Hi team, I see an wired issue that one of my TM suddenly lost connection to
JM.
Once the job running on the TM relocated to a new TM, it can reconnect to
JM again.
And after a while, the new TM running the same job will repeat the same
process.
It is not guaranteed the troubled TMs can reconnect to JM in a reasonable
time frame, like minutes. Sometime it take days in order to reconnect
successfully.

I am using Flink 1.3.2 and Kubernetes. Is this because of network
congestion?

Thanks!

===== Logs from JM ======

*2017-11-16 19:14:40,216 WARN  akka.remote.RemoteWatcher*
                       - Detected unreachable:
[akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]
2017-11-16 19:14:40,218 INFO
org.apache.flink.runtime.jobmanager.JobManager                - Task
manager 
akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager
terminated.
2017-11-16 19:14:40,219 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager was lost/killed:
50cae001c1d97e55889a6051319f4746 @
fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)
        at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
        at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
        at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
        at 
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
        at 
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
        at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at 
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*2017-11-16
19:14:40,219 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state RUNNING to FAILING.
java.lang.Exception: TaskManager was lost/killed:
50cae001c1d97e55889a6051319f4746 @
fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)*
        at 
org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
        at 
org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
        at 
org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
        at 
org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
        at 
org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
        at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
        at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at 
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-11-16 19:14:40,227 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to
restart or fail the job KafkaDemo (env:production)
(3de8f35a1af689237e3c5c94023aba3f) if no longer possible.
2017-11-16 19:14:40,227 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state FAILING to RESTARTING.
2017-11-16 19:14:40,227 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Restarting the job KafkaDemo (env:production)
(3de8f35a1af689237e3c5c94023aba3f).
2017-11-16 19:14:40,229 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Unregistered task manager
fps-flink-taskmanager-701246518-rb4k6/10.225.131.3. Number of
registered task managers 3. Number of available slots 3.
2017-11-16 19:14:40,301 WARN  akka.remote.ReliableDeliverySupervisor
                     - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has
failed, address is now gated for [5000] ms. Reason: [Association
failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]]
Caused by: [fps-flink-taskmanager-701246518-rb4k6: Name does not
resolve]
2017-11-16 19:14:50,228 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state RESTARTING to CREATED.
2017-11-16 19:14:50,228 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Restoring from latest valid checkpoint: Checkpoint 646 @ 1510859465870
for 3de8f35a1af689237e3c5c94023aba3f.
2017-11-16 19:14:50,233 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - No
master state to restore
2017-11-16 19:14:50,233 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched
from state CREATED to RUNNING.
2017-11-16 19:14:50,233 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(6ae48dd06a2a48dce68a3ea83c21a462) switched from CREATED to SCHEDULED.
2017-11-16 19:14:50,234 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(6ae48dd06a2a48dce68a3ea83c21a462) switched from SCHEDULED to
DEPLOYING.
2017-11-16 19:14:50,234 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Deploying Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (attempt #1) to
fps-flink-taskmanager-701246518-v9vjx
2017-11-16 19:14:50,244 WARN  akka.remote.ReliableDeliverySupervisor
                     - Association with remote system
[akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has
failed, address is now gated for [5000] ms. Reason: [Association
failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]]
Caused by: [fps-flink-taskmanager-701246518-rb4k6]
2017-11-16 19:14:50,987 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(6ae48dd06a2a48dce68a3ea83c21a462) switched from DEPLOYING to RUNNING.
2017-11-16 19:15:24,480 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Triggering checkpoint 649 @ 1510859724479
2017-11-16 19:15:25,638 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Completed checkpoint 649 (6724 bytes in 928 ms).
2017-11-16 19:16:22,205 WARN  Remoting
                     - Tried to associate with unreachable remote
address [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416].
Address is now gated for 5000 ms, all messages to this address will be
delivered to dead letters. Reason: [The remote system has quarantined
this system. No further associations to the remote system are possible
until this system is restarted.]
2017-11-16 19:16:50,233 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Triggering checkpoint 648 @ 1510859810233
2017-11-16 19:17:12,972 INFO
org.apache.flink.runtime.jobmanager.JobManager                -
Received message for non-existing checkpoint 647
2017-11-16 19:17:13,685 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Registering TaskManager at
fps-flink-taskmanager-701246518-rb4k6/10.225.131.3 which was marked as
dead earlier because of a heart-beat timeout.
2017-11-16 19:17:13,685 INFO
org.apache.flink.runtime.instance.InstanceManager             -
Registered TaskManager at fps-flink-taskmanager-701246518-rb4k6
(akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager)
as 8905a72ca2655b9ed0bddb50ff3c49a9. Current number of registered
hosts is 4. Current number of alive task slots is 4.
2017-11-16 19:17:24,480 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     -
Triggering checkpoint 650 @ 1510859844480


===== Logs from TM ======

2017-11-16 19:13:33,177 INFO
org.apache.flink.runtime.state.DefaultOperatorStateBackend    -
DefaultOperatorStateBackend snapshot (File Stream Factory @
s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f,
synchronous part) in thread Thread[Async calls on Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1),5,Flink Task
Threads] took 1 ms.*2017-11-16 19:14:35,820 WARN
akka.remote.RemoteWatcher                                     -
Detected unreachable: [akka.tcp://flink@fps-flink-jobmanager:6123]*
2017-11-16 19:14:37,525 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
TaskManager akka://flink/user/taskmanager disconnects from JobManager
akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager
is no longer reachable
2017-11-16 19:15:50,827 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Cancelling all computations and discarding all cached data.
2017-11-16 19:16:23,596 INFO  com.amazonaws.http.AmazonHttpClient
                     - Unable to execute HTTP request: The target
server failed to respond
org.apache.http.NoHttpResponseException: The target server failed to respond
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95)
        at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
        at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
        at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
        at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
        at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
        at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
        at 
com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66)
        at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
        at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715)
        at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520)
        at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
        at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
        at 
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
        at 
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at 
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at 
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
        at 
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1144)
        at 
org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1133)
        at 
org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
        at 
org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:48)
        at 
org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
        at 
org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:319)
        at 
org.apache.flink.runtime.checkpoint.AbstractAsyncSnapshotIOCallable.closeStreamAndGetStateHandle(AbstractAsyncSnapshotIOCallable.java:100)
        at 
org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:270)
        at 
org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:233)
        at 
org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40)
        at 
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:906)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
2017-11-16 19:16:35,816 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Attempting to fail task externally Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c).
2017-11-16 19:16:35,819 WARN  Remoting
                     - Tried to associate with unreachable remote
address [akka.tcp://flink@fps-flink-jobmanager:6123]. Address is now
gated for 5000 ms, all messages to this address will be delivered to
dead letters. Reason: [The remote system has a UID that has been
quarantined. Association aborted.] *2017-11-16 19:16:45,817 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager
disconnects from JobManager
akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager
is no longer reachable*
        at 
org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095)
        at 
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
        at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
        at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
        at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at 
org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:120)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at 
akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
        at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
        at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
        at akka.actor.ActorCell.invoke(ActorCell.scala:486)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-11-16 19:16:55,815 INFO
org.apache.flink.runtime.state.DefaultOperatorStateBackend    -
DefaultOperatorStateBackend snapshot (File Stream Factory @
s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f,
asynchronous part) in thread Thread[pool-7-thread-647,5,Flink Task
Threads] took 202637 ms.
2017-11-16 19:16:55,816 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Triggering cancellation of task code Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c).
2017-11-16 19:17:00,825 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Disassociating from JobManager
2017-11-16 19:17:00,825 INFO  org.apache.flink.runtime.blob.BlobCache
                     - Shutting down BlobCache
2017-11-16 19:17:13,677 INFO
org.apache.flink.runtime.taskmanager.TaskManager              - Trying
to register at JobManager
akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager (attempt 1,
timeout: 500 milliseconds)*2017-11-16 19:17:13,687 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Successful registration at JobManager
(akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager), starting
network stack and library cache.
2017-11-16 19:17:13,687 INFO
org.apache.flink.runtime.taskmanager.TaskManager              -
Determined BLOB server address to be
fps-flink-jobmanager/10.231.37.240:6124 <http://10.231.37.240:6124>.
Starting BLOB cache.
2017-11-16 19:17:13,687 INFO  org.apache.flink.runtime.blob.BlobCache
                     - Created BLOB cache storage directory
/tmp/blobStore-133033d7-a3c1-4ffb-b360-3d0efdc355aa*
2017-11-16 19:17:31,097 WARN
org.apache.flink.runtime.taskmanager.Task                     - Task
'Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to
cancelling signal, but is stuck in method:
 java.util.regex.Pattern.matcher(Pattern.java:1093)
java.lang.String.replace(String.java:2239)
com.trueaccord.scalapb.textformat.TextFormatUtils$.escapeDoubleQuotesAndBackslashes(TextFormatUtils.scala:160)
com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
scala.collection.Iterator$class.foreach(Iterator.scala:742)
scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
java.lang.String.valueOf(String.java:2994)
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
scala.collection.immutable.List.foreach(List.scala:381)
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
scala.collection.AbstractTraversable.addString(Traversable.scala:104)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
scala.collection.SeqLike$class.toString(SeqLike.scala:682)
scala.collection.AbstractSeq.toString(Seq.scala:41)
java.lang.String.valueOf(String.java:2994)
java.lang.StringBuilder.append(StringBuilder.java:131)
scala.StringContext.standardInterpolator(StringContext.scala:125)
scala.StringContext.s(StringContext.scala:95)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
java.lang.Thread.run(Thread.java:748)

2017-11-16 19:17:31,106 WARN
org.apache.flink.runtime.taskmanager.Task                     - Task
'Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to
cancelling signal, but is stuck in method:
 com.trueaccord.scalapb.textformat.TextGenerator.add(TextGenerator.scala:19)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:33)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
scala.collection.Iterator$class.foreach(Iterator.scala:742)
scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
java.lang.String.valueOf(String.java:2994)
scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200)
scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362)
scala.collection.immutable.List.foreach(List.scala:381)
scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355)
scala.collection.AbstractTraversable.addString(Traversable.scala:104)
scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321)
scala.collection.AbstractTraversable.mkString(Traversable.scala:104)
scala.collection.TraversableLike$class.toString(TraversableLike.scala:645)
scala.collection.SeqLike$class.toString(SeqLike.scala:682)
scala.collection.AbstractSeq.toString(Seq.scala:41)
java.lang.String.valueOf(String.java:2994)
java.lang.StringBuilder.append(StringBuilder.java:131)
scala.StringContext.standardInterpolator(StringContext.scala:125)
scala.StringContext.s(StringContext.scala:95)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
java.lang.Thread.run(Thread.java:748)

2017-11-16 19:18:05,486 WARN
org.apache.flink.runtime.taskmanager.Task                     - Task
'Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to
cancelling signal, but is stuck in method:
 java.lang.String.replace(String.java:2240)
com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25)
scala.collection.Iterator$class.foreach(Iterator.scala:742)
scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
scala.collection.AbstractIterable.foreach(Iterable.scala:54)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70)
com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37)
com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13)
com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12)
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12)
com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8)
com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19)
com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34)
com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212)
java.lang.String.valueOf(String.java:2994)
java.lang.StringBuilder.append(StringBuilder.java:131)
scala.StringContext.standardInterpolator(StringContext.scala:125)
scala.StringContext.s(StringContext.scala:95)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:14)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:13)
scala.collection.immutable.List.foreach(List.scala:381)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:13)
com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10)
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869)
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304)
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393)
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193)
org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152)
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
java.lang.Thread.run(Thread.java:748)

2017-11-16 19:18:34,375 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Freeing task resources for Source: KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c).
2017-11-16 19:18:36,257 INFO
org.apache.flink.runtime.taskmanager.Task                     -
Ensuring all FileSystem streams are closed for task Source:
KafkaSource(maxwell.tickets) ->
MaxwellFilter->Maxwell(maxwell.tickets) ->
FixedDelayWatermark(maxwell.tickets) ->
MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)
(484ebabbd13dce5e8503d88005bcdb6c) [FAILED]
2017-11-16 19:18:36,618 ERROR
org.apache.flink.runtime.taskmanager.TaskManager              - Cannot
find task with ID 484ebabbd13dce5e8503d88005bcdb6c to unregister.

Reply via email to