Hi team, I see an wired issue that one of my TM suddenly lost connection to JM. Once the job running on the TM relocated to a new TM, it can reconnect to JM again. And after a while, the new TM running the same job will repeat the same process. It is not guaranteed the troubled TMs can reconnect to JM in a reasonable time frame, like minutes. Sometime it take days in order to reconnect successfully.
I am using Flink 1.3.2 and Kubernetes. Is this because of network congestion? Thanks! ===== Logs from JM ====== *2017-11-16 19:14:40,216 WARN akka.remote.RemoteWatcher* - Detected unreachable: [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] 2017-11-16 19:14:40,218 INFO org.apache.flink.runtime.jobmanager.JobManager - Task manager akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager terminated. 2017-11-16 19:14:40,219 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED. java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871) at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)*2017-11-16 19:14:40,219 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RUNNING to FAILING. java.lang.Exception: TaskManager was lost/killed: 50cae001c1d97e55889a6051319f4746 @ fps-flink-taskmanager-701246518-rb4k6 (dataPort=43871)* at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217) at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533) at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192) at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167) at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212) at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228) at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2017-11-16 19:14:40,227 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to restart or fail the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) if no longer possible. 2017-11-16 19:14:40,227 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state FAILING to RESTARTING. 2017-11-16 19:14:40,227 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Restarting the job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f). 2017-11-16 19:14:40,229 INFO org.apache.flink.runtime.instance.InstanceManager - Unregistered task manager fps-flink-taskmanager-701246518-rb4k6/10.225.131.3. Number of registered task managers 3. Number of available slots 3. 2017-11-16 19:14:40,301 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6: Name does not resolve] 2017-11-16 19:14:50,228 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state RESTARTING to CREATED. 2017-11-16 19:14:50,228 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring from latest valid checkpoint: Checkpoint 646 @ 1510859465870 for 3de8f35a1af689237e3c5c94023aba3f. 2017-11-16 19:14:50,233 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - No master state to restore 2017-11-16 19:14:50,233 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job KafkaDemo (env:production) (3de8f35a1af689237e3c5c94023aba3f) switched from state CREATED to RUNNING. 2017-11-16 19:14:50,233 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from CREATED to SCHEDULED. 2017-11-16 19:14:50,234 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from SCHEDULED to DEPLOYING. 2017-11-16 19:14:50,234 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (attempt #1) to fps-flink-taskmanager-701246518-v9vjx 2017-11-16 19:14:50,244 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]] Caused by: [fps-flink-taskmanager-701246518-rb4k6] 2017-11-16 19:14:50,987 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (6ae48dd06a2a48dce68a3ea83c21a462) switched from DEPLOYING to RUNNING. 2017-11-16 19:15:24,480 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 649 @ 1510859724479 2017-11-16 19:15:25,638 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 649 (6724 bytes in 928 ms). 2017-11-16 19:16:22,205 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.] 2017-11-16 19:16:50,233 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 648 @ 1510859810233 2017-11-16 19:17:12,972 INFO org.apache.flink.runtime.jobmanager.JobManager - Received message for non-existing checkpoint 647 2017-11-16 19:17:13,685 INFO org.apache.flink.runtime.instance.InstanceManager - Registering TaskManager at fps-flink-taskmanager-701246518-rb4k6/10.225.131.3 which was marked as dead earlier because of a heart-beat timeout. 2017-11-16 19:17:13,685 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at fps-flink-taskmanager-701246518-rb4k6 (akka.tcp://flink@fps-flink-taskmanager-701246518-rb4k6:46416/user/taskmanager) as 8905a72ca2655b9ed0bddb50ff3c49a9. Current number of registered hosts is 4. Current number of alive task slots is 4. 2017-11-16 19:17:24,480 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 650 @ 1510859844480 ===== Logs from TM ====== 2017-11-16 19:13:33,177 INFO org.apache.flink.runtime.state.DefaultOperatorStateBackend - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, synchronous part) in thread Thread[Async calls on Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1),5,Flink Task Threads] took 1 ms.*2017-11-16 19:14:35,820 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://flink@fps-flink-jobmanager:6123]* 2017-11-16 19:14:37,525 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable 2017-11-16 19:15:50,827 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data. 2017-11-16 19:16:23,596 INFO com.amazonaws.http.AmazonHttpClient - Unable to execute HTTP request: The target server failed to respond org.apache.http.NoHttpResponseException: The target server failed to respond at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:95) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) at com.amazonaws.http.protocol.SdkHttpRequestExecutor.doReceiveResponse(SdkHttpRequestExecutor.java:66) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:715) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:520) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892) at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteUnnecessaryFakeDirectories(S3AFileSystem.java:1144) at org.apache.hadoop.fs.s3a.S3AFileSystem.finishedWrite(S3AFileSystem.java:1133) at org.apache.hadoop.fs.s3a.S3AOutputStream.close(S3AOutputStream.java:142) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106) at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:48) at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64) at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:319) at org.apache.flink.runtime.checkpoint.AbstractAsyncSnapshotIOCallable.closeStreamAndGetStateHandle(AbstractAsyncSnapshotIOCallable.java:100) at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:270) at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:233) at org.apache.flink.runtime.io.async.AbstractAsyncIOCallable.call(AbstractAsyncIOCallable.java:72) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:40) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:906) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 2017-11-16 19:16:35,816 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c). 2017-11-16 19:16:35,819 WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@fps-flink-jobmanager:6123]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has a UID that has been quarantined. Association aborted.] *2017-11-16 19:16:45,817 INFO org.apache.flink.runtime.taskmanager.Task - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) switched from RUNNING to FAILED. java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager: JobManager is no longer reachable* at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1095) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:311) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:120) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2017-11-16 19:16:55,815 INFO org.apache.flink.runtime.state.DefaultOperatorStateBackend - DefaultOperatorStateBackend snapshot (File Stream Factory @ s3a://zendesk-usw2-fraud-prevention-production/checkpoints/3de8f35a1af689237e3c5c94023aba3f, asynchronous part) in thread Thread[pool-7-thread-647,5,Flink Task Threads] took 202637 ms. 2017-11-16 19:16:55,816 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c). 2017-11-16 19:17:00,825 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager 2017-11-16 19:17:00,825 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache 2017-11-16 19:17:13,677 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)*2017-11-16 19:17:13,687 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink@fps-flink-jobmanager:6123/user/jobmanager), starting network stack and library cache. 2017-11-16 19:17:13,687 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be fps-flink-jobmanager/10.231.37.240:6124 <http://10.231.37.240:6124>. Starting BLOB cache. 2017-11-16 19:17:13,687 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-133033d7-a3c1-4ffb-b360-3d0efdc355aa* 2017-11-16 19:17:31,097 WARN org.apache.flink.runtime.taskmanager.Task - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method: java.util.regex.Pattern.matcher(Pattern.java:1093) java.lang.String.replace(String.java:2239) com.trueaccord.scalapb.textformat.TextFormatUtils$.escapeDoubleQuotesAndBackslashes(TextFormatUtils.scala:160) com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37) com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25) com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25) scala.collection.Iterator$class.foreach(Iterator.scala:742) scala.collection.AbstractIterator.foreach(Iterator.scala:1194) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterable.foreach(Iterable.scala:54) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8) com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19) com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34) com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212) java.lang.String.valueOf(String.java:2994) scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362) scala.collection.immutable.List.foreach(List.scala:381) scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355) scala.collection.AbstractTraversable.addString(Traversable.scala:104) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321) scala.collection.AbstractTraversable.mkString(Traversable.scala:104) scala.collection.TraversableLike$class.toString(TraversableLike.scala:645) scala.collection.SeqLike$class.toString(SeqLike.scala:682) scala.collection.AbstractSeq.toString(Seq.scala:41) java.lang.String.valueOf(String.java:2994) java.lang.StringBuilder.append(StringBuilder.java:131) scala.StringContext.standardInterpolator(StringContext.scala:125) scala.StringContext.s(StringContext.scala:95) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10) org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869) org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304) org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393) org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233) org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193) org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152) org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483) org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87) org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55) org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263) org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) java.lang.Thread.run(Thread.java:748) 2017-11-16 19:17:31,106 WARN org.apache.flink.runtime.taskmanager.Task - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method: com.trueaccord.scalapb.textformat.TextGenerator.add(TextGenerator.scala:19) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:33) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37) com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25) com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25) scala.collection.Iterator$class.foreach(Iterator.scala:742) scala.collection.AbstractIterator.foreach(Iterator.scala:1194) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterable.foreach(Iterable.scala:54) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8) com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19) com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34) com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212) java.lang.String.valueOf(String.java:2994) scala.collection.mutable.StringBuilder.append(StringBuilder.scala:200) scala.collection.TraversableOnce$$anonfun$addString$1.apply(TraversableOnce.scala:362) scala.collection.immutable.List.foreach(List.scala:381) scala.collection.TraversableOnce$class.addString(TraversableOnce.scala:355) scala.collection.AbstractTraversable.addString(Traversable.scala:104) scala.collection.TraversableOnce$class.mkString(TraversableOnce.scala:321) scala.collection.AbstractTraversable.mkString(Traversable.scala:104) scala.collection.TraversableLike$class.toString(TraversableLike.scala:645) scala.collection.SeqLike$class.toString(SeqLike.scala:682) scala.collection.AbstractSeq.toString(Seq.scala:41) java.lang.String.valueOf(String.java:2994) java.lang.StringBuilder.append(StringBuilder.java:131) scala.StringContext.standardInterpolator(StringContext.scala:125) scala.StringContext.s(StringContext.scala:95) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:12) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10) org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869) org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304) org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393) org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233) org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193) org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152) org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483) org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87) org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55) org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263) org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) java.lang.Thread.run(Thread.java:748) 2017-11-16 19:18:05,486 WARN org.apache.flink.runtime.taskmanager.Task - Task 'Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1)' did not react to cancelling signal, but is stuck in method: java.lang.String.replace(String.java:2240) com.trueaccord.scalapb.textformat.TextGenerator.addMaybeEscape(TextGenerator.scala:29) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:74) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:41) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37) com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25) com.trueaccord.scalapb.textformat.Printer$$anonfun$printField$1.apply(Printer.scala:25) scala.collection.Iterator$class.foreach(Iterator.scala:742) scala.collection.AbstractIterator.foreach(Iterator.scala:1194) scala.collection.IterableLike$class.foreach(IterableLike.scala:72) scala.collection.AbstractIterable.foreach(Iterable.scala:54) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:25) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.printFieldValue(Printer.scala:70) com.trueaccord.scalapb.textformat.Printer$.printSingleField(Printer.scala:37) com.trueaccord.scalapb.textformat.Printer$.printField(Printer.scala:28) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:13) com.trueaccord.scalapb.textformat.Printer$$anonfun$print$2.apply(Printer.scala:12) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:12) com.trueaccord.scalapb.textformat.Printer$.print(Printer.scala:8) com.trueaccord.scalapb.textformat.Printer$.printToString(Printer.scala:19) com.trueaccord.scalapb.TextFormat$.printToUnicodeString(TextFormat.scala:34) com.zendesk.fraud_prevention.data_types.MaxwellEvent.toString(MaxwellEvent.scala:212) java.lang.String.valueOf(String.java:2994) java.lang.StringBuilder.append(StringBuilder.java:131) scala.StringContext.standardInterpolator(StringContext.scala:125) scala.StringContext.s(StringContext.scala:95) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:14) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper$$anonfun$flatMap$1.apply(MaxwellFilterFlatMapper.scala:13) scala.collection.immutable.List.foreach(List.scala:381) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:13) com.zendesk.fraud_prevention.functions.MaxwellFilterFlatMapper.flatMap(MaxwellFilterFlatMapper.scala:10) org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:528) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:503) org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:483) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:891) org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:869) org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:304) org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:393) org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecord(AbstractFetcher.java:233) org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.emitRecord(Kafka09Fetcher.java:193) org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher.runFetchLoop(Kafka09Fetcher.java:152) org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:483) org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87) org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55) org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263) org.apache.flink.runtime.taskmanager.Task.run(Task.java:702) java.lang.Thread.run(Thread.java:748) 2017-11-16 19:18:34,375 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c). 2017-11-16 19:18:36,257 INFO org.apache.flink.runtime.taskmanager.Task - Ensuring all FileSystem streams are closed for task Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) -> FixedDelayWatermark(maxwell.tickets) -> MaxwellFPSEvent->InfluxDBData(maxwell.tickets) (1/1) (484ebabbd13dce5e8503d88005bcdb6c) [FAILED] 2017-11-16 19:18:36,618 ERROR org.apache.flink.runtime.taskmanager.TaskManager - Cannot find task with ID 484ebabbd13dce5e8503d88005bcdb6c to unregister.