[jira] [Commented] (FLINK-7278) Flink job can stuck while ZK leader reelected during ZK cluster migration

2017-08-09 Thread Zhenzhong Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120845#comment-16120845
 ] 

Zhenzhong Xu commented on FLINK-7278:
-

[~trohrm...@apache.org] unfortunately, we don't have logs for this instance any 
more. However, I think there is the only occurence we have observed. Let me 
know if there is any information I can help with.

> Flink job can stuck while ZK leader reelected during ZK cluster migration 
> --
>
> Key: FLINK-7278
> URL: https://issues.apache.org/jira/browse/FLINK-7278
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Zhenzhong Xu
>Priority: Minor
>
> We have observed an potential failure case while Flink job was running during 
> ZK migration. Below describes the scenario.
> 1. Flink cluster running with standalone mode on Netfilx Titus container 
> runtime 
> 2. We performed a ZK migration by updating new OS image one node at a time.
> 3. During ZK leader reelection, Flink cluster starts to exhibit failures and 
> eventually end in a non-recoverable failure mode.
> 4. This behavior does not repro every time, may be caused by an edge race 
> condition.
> Below is a list of error messages ordered by event time:
> 017-07-22 02:47:44,535 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Source -> 
> Sink: Sink (67/176) (0442d63c89809ad86f38874c845ba83f) switched from RUNNING 
> to FAILED.
> java.lang.Exception: TaskManager was lost/killed: ResourceID
> {resourceId='f519795dfabcecfd7863ed587efdb398'}
> @ titus-123072-worker-3-39 (dataPort=46879)
> at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
> at 
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
> at 
> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
> at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
> at 
> org.apache.flink.runtime.instance.InstanceManager.unregisterAllTaskManagers(InstanceManager.java:234)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:330)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
> at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2017-07-22 02:47:44,621 WARN com.netflix.spaas.runtime.FlinkJobManager - 
> Discard message 
> LeaderSessionMessage(7a247ad9-531b-4f27-877b-df41f9019431,Disconnect(0b300c04592b19750678259cd09fea95,java.lang.Exception:
>  TaskManager akka://flink/user/taskmanager is disassociating)) because the 
> expected leader session ID None did not equal the received leader session ID 
> Some(7a247ad9-531b-4f27-877b-df41f9019431).
> Permalink Edit Delete 
> zxu Zhenzhong Xu added a comment - 07/26/2017 09:24 PM
> 2017-07-22 02:47:45,015 WARN 
> netflix.spaas.shaded.org.apache.zookeeper.ClientCnxn - Session 
> 0x2579bebfd265054 for server 100.83.64.121/100.83.64.121:2181, unexpected 
> error, closing socket connection and attempting reconnect
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> at 
> 

[jira] [Commented] (FLINK-7278) Flink job can stuck while ZK leader reelected during ZK cluster migration

2017-07-31 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-7278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107017#comment-16107017
 ] 

Till Rohrmann commented on FLINK-7278:
--

Hi [~zhenzhongxu], thanks for reporting the issue. Does the ZK migration mean 
that the new ZK cluster is available under the same address as the old one?

Would it be possible to share the full logs with us? If not, then maybe you can 
share them privately with me. Let me know what's possible.

> Flink job can stuck while ZK leader reelected during ZK cluster migration 
> --
>
> Key: FLINK-7278
> URL: https://issues.apache.org/jira/browse/FLINK-7278
> Project: Flink
>  Issue Type: Bug
>  Components: Distributed Coordination
>Reporter: Zhenzhong Xu
>Priority: Minor
>
> We have observed an potential failure case while Flink job was running during 
> ZK migration. Below describes the scenario.
> 1. Flink cluster running with standalone mode on Netfilx Titus container 
> runtime 
> 2. We performed a ZK migration by updating new OS image one node at a time.
> 3. During ZK leader reelection, Flink cluster starts to exhibit failures and 
> eventually end in a non-recoverable failure mode.
> 4. This behavior does not repro every time, may be caused by an edge race 
> condition.
> Below is a list of error messages ordered by event time:
> 017-07-22 02:47:44,535 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Source -> 
> Sink: Sink (67/176) (0442d63c89809ad86f38874c845ba83f) switched from RUNNING 
> to FAILED.
> java.lang.Exception: TaskManager was lost/killed: ResourceID
> {resourceId='f519795dfabcecfd7863ed587efdb398'}
> @ titus-123072-worker-3-39 (dataPort=46879)
> at 
> org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
> at 
> org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
> at 
> org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
> at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
> at 
> org.apache.flink.runtime.instance.InstanceManager.unregisterAllTaskManagers(InstanceManager.java:234)
> at 
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:330)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
> at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> at 
> org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
> at 
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:118)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2017-07-22 02:47:44,621 WARN com.netflix.spaas.runtime.FlinkJobManager - 
> Discard message 
> LeaderSessionMessage(7a247ad9-531b-4f27-877b-df41f9019431,Disconnect(0b300c04592b19750678259cd09fea95,java.lang.Exception:
>  TaskManager akka://flink/user/taskmanager is disassociating)) because the 
> expected leader session ID None did not equal the received leader session ID 
> Some(7a247ad9-531b-4f27-877b-df41f9019431).
> Permalink Edit Delete 
> zxu Zhenzhong Xu added a comment - 07/26/2017 09:24 PM
> 2017-07-22 02:47:45,015 WARN 
> netflix.spaas.shaded.org.apache.zookeeper.ClientCnxn - Session 
> 0x2579bebfd265054 for server 100.83.64.121/100.83.64.121:2181, unexpected 
> error, closing socket connection and attempting reconnect
> java.io.IOException: Connection reset by peer
> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> at