[jira] [Comment Edited] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208437#comment-17208437 ] Steven Zhen Wu edited comment on FLINK-19401 at 10/6/20, 12:42 AM: --- [~roman_khachatryan] I don't know if repeated checkpoint recovery is the root/main cause or not. [~trohrmann] identified two problems during his investigation. This is one of the identified problem. Regarding the logs related to Titus, please ignore them. They are just noise. We haven't cleaned up our logs yet. was (Author: stevenz3wu): [~roman_khachatryan] I don't know if repeated checkpoint recovery is the root/main cause or not. [~trohrmann] identified two problems during his investigation. This is one of the identified problem. Regarding the logs related to Titusm please ignore them. They are just noise. We haven't cleaned up our logs yet. > Job stuck in restart loop due to excessive checkpoint recoveries which block > the JobMaster > -- > > Key: FLINK-19401 > URL: https://issues.apache.org/jira/browse/FLINK-19401 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing >Affects Versions: 1.10.1, 1.11.2 >Reporter: Steven Zhen Wu >Priority: Critical > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > Flink job sometimes got into a restart loop for many hours and can't recover > until redeployed. We had some issue with Kafka that initially caused the job > to restart. > Below is the first of the many exceptions for "ResourceManagerException: > Could not find registered job manager" error. > {code} > 2020-09-19 00:03:31,614 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP > rofile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,615 INFO > org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl > [flink-akka.actor.default-dispatcher-35973] - Requesting new slot > [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile > ResourceProfile{UNKNOWN} from resource manager. > 2020-09-19 00:03:31,620 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph > [flink-akka.actor.default-dispatcher-35973] - Source: k2-csevpc -> > k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> > Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED > to FAILED. > java.util.concurrent.CompletionException: > java.util.concurrent.CompletionException: > org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: > No pooled slot available and request to ResourceManager for new slot failed > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433) > at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > at > java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792) > at > java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153) > at > org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063) > at > org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511) > at > org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311) > at >
[jira] [Comment Edited] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208335#comment-17208335 ] Roman Khachatryan edited comment on FLINK-19401 at 10/5/20, 9:42 PM: - I do see that the same data is downloaded over and over, but I'm not sure whether it's the root cause. From the logs I see that heartbeats *on JM* from TMs are expired first, and heartbeats *on TMs* from JM expire two minutes later: {code:java} flink-19401 $ grep -ah -B1 'Heartbeat of Task' *.log | grep '^2020' | cut -c-16 | sort | uniq -c 66 2020-09-18 23:57 11 2020-09-19 01:06 flink-19401 $ grep -ah -B1 'heartbeat of Job' *.log | grep '^2020' | cut -c-16 | sort | uniq -c 509 2020-09-18 23:59 2 2020-09-19 00:03 2 2020-09-19 00:32 2 2020-09-19 00:37 2 2020-09-19 01:06 {code} (77 vs 517 is due to parallelism level I think) Furthermore, the first 8 recoveries start after failures in Kafka: NPE in NetworkClient and "SSLProtocolException: Handshake message sequence violation, 2". Then they happen after Kafka failures and the aforementioned TM heartbeat timeouts on JM. So it looks like that the root cause is misconfiguration and/or network. WDYT [~trohrmann], [~stevenz3wu] ? Steven, can you please also explain what Titus is doing here? (looks like it'is adding and then removing the task every minute - does it ring any bell?) {code:java} $ grep 72cc604b-dcd5-4c75-a466-0321d7c51c3e *.log 2020-09-18 23:55:43,804 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60067 ms. 2020-09-18 23:56:11,256 - Deploying Source: ee_clevent_presentation -> ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,256 - Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw -> (Filter -> cybertronLogReplicated, Filter -> cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw -> (cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:43,872 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-18 23:57:43,940 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60068 ms. 2020-09-18 23:58:44,008 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-18 23:59:11,794 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (488/640) (attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:59:27,101 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (478/640) (attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:59:44,075 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60066 ms. 2020-09-19 00:00:44,144 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-19 00:01:44,634 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60489 ms. 2020-09-19 00:02:44,710 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-19 00:03:44,799 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60088 ms. 2020-09-19 00:04:44,871 - Adding Titus task
[jira] [Comment Edited] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
[ https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208335#comment-17208335 ] Roman Khachatryan edited comment on FLINK-19401 at 10/5/20, 9:31 PM: - I do see that the same data is downloaded over and over, but I'm not sure whether it's the root cause. From the logs I see that heartbeat from *on JM* from TMs are expired first, and heartbeat *on TMs* expire two minutes later: {code:java} flink-19401 $ grep -ah -B1 'Heartbeat of Task' *.log | grep '^2020' | cut -c-16 | sort | uniq -c 66 2020-09-18 23:57 11 2020-09-19 01:06 flink-19401 $ grep -ah -B1 'heartbeat of Job' *.log | grep '^2020' | cut -c-16 | sort | uniq -c 509 2020-09-18 23:59 2 2020-09-19 00:03 2 2020-09-19 00:32 2 2020-09-19 00:37 2 2020-09-19 01:06 {code} (77 vs 517 is due to parallelism level I think) Furthermore, the first 8 recoveries start after failures in Kafka: NPE in NetworkClient and "SSLProtocolException: Handshake message sequence violation, 2". Then they happen after Kafka failures and the aforementioned TM heartbeat timeouts on JM. So it looks like that the root cause is misconfiguration and/or network. WDYT [~trohrmann], [~stevenz3wu] ? Steven, can you please also explain what Titus is doing here? (looks like it'is adding and then removing the task every minute - does it ring any bell?) {code:java} $ grep 72cc604b-dcd5-4c75-a466-0321d7c51c3e *.log 2020-09-18 23:55:43,804 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60067 ms. 2020-09-18 23:56:11,256 - Deploying Source: ee_clevent_presentation -> ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,256 - Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw -> (Filter -> cybertronLogReplicated, Filter -> cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw -> (cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) (510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:56:43,872 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-18 23:57:43,940 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60068 ms. 2020-09-18 23:58:44,008 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-18 23:59:11,794 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (488/640) (attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:59:27,101 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw -> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) (478/640) (attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833) 2020-09-18 23:59:44,075 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60066 ms. 2020-09-19 00:00:44,144 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-19 00:01:44,634 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60489 ms. 2020-09-19 00:02:44,710 - Adding Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not registered. 2020-09-19 00:03:44,799 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is being removed from TM termination pool after 60088 ms. 2020-09-19 00:04:44,871 - Adding Titus task