[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330890#comment-17330890 ] Chen Qin edited comment on FLINK-10052 at 4/23/21, 4:13 PM: here is another exception we observed in another job, may or may not caused by this pr. {code:java} 2021-04-23 11:09:03,388 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_e26_1617655625710_8692_01_000115 because: ResourceManager leader changed to new address null 2021-04-23 11:09:03,391 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph- USER_AGGREGATE_STATE.user_signal_v2.SINK-async (200/360) (bf815073df08c3426bf41b63d74510fb) switched from RUNNING to FAILED on container_e26_1617655625710_8692_01_000115 @ .ec2.pin220.com (dataPort=46719). org.apache.flink.util.FlinkException: ResourceManager leader changed to new address null at org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093) at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173) at org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:539) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612) at akka.actor.ActorCell.invoke(ActorCell.scala:581) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268) at akka.dispatch.Mailbox.run(Mailbox.scala:229) at akka.dispatch.Mailbox.exec(Mailbox.scala:241) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} was (Author: foxss): here is another exception we observed in another job after apply this pr {code:java} 2021-04-23 11:09:03,388 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_e26_1617655625710_8692_01_000115 because: ResourceManager leader changed to new address null 2021-04-23 11:09:03,391 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph- USER_AGGREGATE_STATE.user_signal_v2.SINK-async (200/360) (bf815073df08c3426bf41b63d74510fb) switched from RUNNING to FAILED on container_e26_1617655625710_8692_01_000115 @ .ec2.pin220.com (dataPort=46719). org.apache.flink.util.FlinkException: ResourceManager leader changed to new address null at org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093) at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173) at org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundRe
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330858#comment-17330858 ] Chen Qin edited comment on FLINK-10052 at 4/23/21, 4:04 PM: run load testing on pr, seems suspended message no longer trigger leadership lost and job restart. At same time, found following exception when job restarts caused by other user jar issue. {code:java} 2021-04-21 18:24:44,639 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor- Fatal error occurred in TaskExecutor akka.tcp://flink@xxx:33435/user/rpc/taskmanager_0. org.apache.flink.util.FlinkException: Unhandled error in ZooKeeperLeaderRetrievalService:Background operation retry gave up at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.unhandledError(ZooKeeperLeaderRetrievalService.java:208) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709) at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:874) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:990) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:943) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:66) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:346) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:862) ... 10 more {code} was (Author: foxss): run load testing on pr, found following exception when job restarts. {code:java} 2021-04-21 18:24:44,639 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor- Fatal error occurred in TaskExecutor akka.tcp://flink@xxx:33435/user/rpc/taskmanager_0. org.apache.flink.util.FlinkException: Unhandled error in ZooKeeperLeaderRetrievalService:Background operation retry gave up at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.unhandledError(ZooKeeperLeaderRetrievalService.java:208) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:713) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:709) at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100) at org.apache.flink.shaded.curator4.org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) at org.apache.flink.shaded.curator4.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:708) at org.apache.flink.shaded.curator4.org.apache.curator.framework.imp
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311547#comment-17311547 ] Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM: Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677] I think it should not be increased logic here,[handleStateChange|#L152]],Here is just to add logs at the beginning. Connection State processing should be managed by [ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) was (Author: andrew_lin): Note change here!FLINK-18677 I think it should not be increased logic here,[handleStateChange|#L152]],Here is just to add logs at the beginning. Connection State processing should be managed by [ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17311547#comment-17311547 ] Andrew.D.lin edited comment on FLINK-10052 at 3/30/21, 2:21 PM: Note change here!FLINK-18677 I think it should not be increased logic here,[handleStateChange|#L152]],Here is just to add logs at the beginning. Connection State processing should be managed by [ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) was (Author: andrew_lin): Note change here![FLINK-18677|https://issues.apache.org/jira/browse/FLINK-18677] I think it should not be increased logic here,[handleStateChange|[https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152]|https://github.com/apache/flink/blob/940dfb0deccb31e0ca576b4c044cbf588e0765dd/flink-runtime/src/main/java/org/apache/flink/runtime/leaderretrieval/ZooKeeperLeaderRetrievalDriver.java#L152],]Here is just to add logs at the beginning. Connection State processing should be managed by[ConnectionStateErrorPolicy|https://github.com/chendonglin521/curator-1/blob/15a9f03f6f7b156806d05d0dd7ce6cfd3ef39c72/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateErrorPolicy.java#L27] (support session and Standard) > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.13.0 > > Time Spent: 50m > Remaining Estimate: 0h > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887268#comment-16887268 ] lamber-ken edited comment on FLINK-10052 at 7/17/19 5:28 PM: - [~Tison] yes, I checke the shaded class file by +javap -v+ command. The main thing is that maven-shaded-plugin help us relocate +org.apache.zookeeper.ClientCnxn$EventThread+ I'll update PR#9066. It's welcome if you can help me to review the pr. was (Author: lamber-ken): [~Tison] yes, I checke the shaded class file by +javap -v+ command, I'll update PR#9066. It's welcome if you can help me to review the pr. > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886699#comment-16886699 ] lamber-ken edited comment on FLINK-10052 at 7/17/19 6:01 AM: - [~Tison] (y), I have some points need to talk with you. First, for your first point, I thought it yesterday and wont create a new curator Jira like your CURATOR-532 that user can manually config ZooKeeper3.4.x Compatibility, but I give up that idea, because I found that it also needs to reflect +org.apache.zookeeper.ClientCnxn$EventThread+ which may throw ClassNotFoundException because of shading. Click it for more detail [InjectSessionExpiration|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java]. Second, for your second point, I am not familiar with LeaderSeclector currently and I'm learning about it. I also think it is a ideally way we can just use SessionConnectionStateErrorPolicy directly in curator-4.x Third, I don't understand the meaning of a flink scope leader latch was (Author: lamber-ken): [~Tison] (y), I have some points need to talk with you. First, for your first point, I thought it yesterday and wont create a new curator Jira like your CURATOR-532 that use can manually config ZooKeeper3.4.x Compatibility, but I give up that idea, because I found that it also needs to reflect +org.apache.zookeeper.ClientCnxn$EventThread+ which may throw ClassNotFoundException because of shading. Click it for more detail [InjectSessionExpiration|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/InjectSessionExpiration.java]. Second, for your second point, I am not familiar with LeaderSeclector currently and I'm learning about it. I also think it is a ideally way we can just use SessionConnectionStateErrorPolicy directly in curator-4.x Third, I don't understand the meaning of a flink scope leader latch > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16886663#comment-16886663 ] lamber-ken edited comment on FLINK-10052 at 7/17/19 4:06 AM: - [~Tison], right, it's a better way to upgrate curator dependcy to fix this ideally, but there's a problem that curator-4.x detect the version of zookeeper by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not, like bellow. {code:java} Class.forName("org.apache.admin.ZooKeeperAdmin"); {code} But flink-runtime module shades +org.apache.zookeeper+ to +org.apache.flink.shaded.zookeeper.org.apache.zookeeper+ , so it'll detect failed. I think two ways to fix this issue, First, rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator moduleflink, like [PR#9066|https://github.com/apache/flink/pull/9066]. Second, it also could be achievable by using Curator's LeaderSelector instead of the LeaderLatch as mentioned in issue description was (Author: lamber-ken): [~Tison], right, it's a better way to upgrate curator dependcy to fix this ideally, but there's a problem that curator-4.x detect the version of zookeeper by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not, like bellow. {code:java} Class.forName("org.apache.admin.ZooKeeperAdmin"); {code} But flink-runtime module shades +org.apache.zookeeper+ to +org.apache.flink.shaded.zookeeper.org.apache.zookeeper+ , so it'll detect failed. I think two ways to fix this issue, First, rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator moduleflink, like [PR#9066|https://github.com/apache/flink/pull/9066]. Seconde, it also could be achievable by using Curator's LeaderSelector instead of the LeaderLatch as mentioned in issue description > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884567#comment-16884567 ] lamber-ken edited comment on FLINK-10052 at 7/14/19 5:30 AM: - Hi, all. BTW, if we upgrade curator dependency to 4.x, there's a problem that curator-4.x detect the version of zookeeper by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not. But flink-runtime module shades +org.apache.zookeeper+ to +org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+ So it will not fix this issue by upgrading curator's version. Here is [Curator Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java]. was (Author: lamber-ken): Hi, all. BTW, if we upgrade curator dependency to 4.x, there's a problem that curator-4.x detect the version of zookeeper by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not. But flink-runtime module will shade +org.apache.zookeeper+ to +org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+ So it will not fix this issue by upgrading curator's version. Here is [Curator Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java]. > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884567#comment-16884567 ] lamber-ken edited comment on FLINK-10052 at 7/14/19 5:29 AM: - Hi, all. BTW, if we upgrade curator dependency to 4.x, there's a problem that curator-4.x detect the version of zookeeper by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not. But flink-runtime module will shade +org.apache.zookeeper+ to +org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+ So it will not fix this issue by upgrading curator's version. Here is [Curator Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java]. was (Author: lamber-ken): Hi, all. BTW, if we upgrade curator dependency to 4.x, there's a problem that curator-4.x detect the version of zookeeper by test whether +org.apache.zookeeper.admin.ZooKeeperAdmin+ is in classpath or not. But flink-runtime module will shade +org.apache.zookeeper+ to +org.apache.flink.shaded.zookeeper.org.apache.zookeeper.+ So it will not fix this issue by upgrading curator's version. Here is [Curator Compatibility|https://github.com/apache/curator/blob/master/curator-client/src/main/java/org/apache/curator/utils/Compatibility.java]. > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882303#comment-16882303 ] lamber-ken edited comment on FLINK-10052 at 7/10/19 5:46 PM: - Thanks for remind me [~elevy] that I created a duplicated jira [FLINK-13189|https://issues.apache.org/jira/browse/FLINK-13189], I think may nobody solved this issuse before because of latest code of flink in github. I solve this problem by rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator module, and any suggestion will be welcome, thanks. was (Author: lamber-ken): Hi all, I'm very sorry that I created a duplicated jira [FLINK-13189|https://issues.apache.org/jira/browse/FLINK-13189], I think may nobody solved this issuse before because of latest code of flink in github. I solve this problem by rewrite +LeaderLatch#handleStateChange+ at flink-shaded-curator module, and any suggestion will be welcome, thanks. > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605551#comment-16605551 ] Dominik Wosiński edited comment on FLINK-10052 at 9/6/18 9:48 AM: -- This may be also possibly related to : [FLINK-5996|https://issues.apache.org/jira/browse/FLINK-5996] was (Author: wosinsan): https://issues.apache.org/jira/browse/FLINK-5996 > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605551#comment-16605551 ] Chesnay Schepler edited comment on FLINK-10052 at 9/6/18 9:48 AM: -- https://issues.apache.org/jira/browse/FLINK-5996 was (Author: wosinsan): [#https://issues.apache.org/jira/browse/FLINK-5996] > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0 >Reporter: Till Rohrmann >Assignee: Dominik Wosiński >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568832#comment-16568832 ] Elias Levy edited comment on FLINK-10052 at 8/3/18 9:46 PM: [~till.rohrmann] as I mentioned in FLINK-10011, it may not be necessary to replace the {{LeaderLatch}} Curator recipe to avoid loosing leadership during temporary connection failures. The Curator error handling [documentation|https://curator.apache.org/errors.html] talks about a {{SessionConnectionStateErrorPolicy}} that treats {{SUSPENDED}} and {{LOST}} connection states differently. And this [test|https://github.com/apache/curator/blob/d502dde1c4601b2abc6d831d764561a73316bf00/curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java#L72-L146] shows how leadership is not lost with a {{LeaderLatch}} and that policy. The [code|https://github.com/apache/curator/blob/ed3082ecfebc332ba96da7a5bda4508a1985db6e/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625-L631] implementing the policy. And [this shows|https://github.com/apache/curator/blob/5920c744508afd678a20309313e1b8d78baac0c4/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L298-L314] that Curator will inject a session expiration even while it is in {{SUSPENDED}} state, so that a disconnected client won't continue to think it is leader past its session expiration. So it is possible that all we need to do is call {{connectionStateErrorPolicy(new SessionConnectionStateErrorPolicy())}} in the {{CuratorFrameworkFactory}}. was (Author: elevy): [~till.rohrmann] as I mentioned in FLINK-10011, it may not be necessary to replace the {{LeaderLatch}} Curator recipe to avoid loosing leadership during temporary connection failures. The Curator error handling [documentation|https://curator.apache.org/errors.html] talks about a {{SessionConnectionStateErrorPolicy}} that treats {{SUSPENDED }}and {{LOST}} connection states differently. And this [test|https://github.com/apache/curator/blob/d502dde1c4601b2abc6d831d764561a73316bf00/curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java#L72-L146] shows how leadership is not lost with a {{LeaderLatch}} and that policy. The [code|https://github.com/apache/curator/blob/ed3082ecfebc332ba96da7a5bda4508a1985db6e/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L625-L631] implementing the policy. And [this shows|https://github.com/apache/curator/blob/5920c744508afd678a20309313e1b8d78baac0c4/curator-framework/src/main/java/org/apache/curator/framework/state/ConnectionStateManager.java#L298-L314] that Curator will inject a session expiration even while it is in {{SUSPENDED }}state, so that a disconnected client won't continue to think it is leader past its session expiration. So it is possible that all we need to do is call {{connectionStateErrorPolicy(new SessionConnectionStateErrorPolicy())}} in the {{CuratorFrameworkFactory}}. > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0 >Reporter: Till Rohrmann >Priority: Major > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)