[jira] [Comment Edited] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-05 Thread Steven Zhen Wu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208437#comment-17208437
 ] 

Steven Zhen Wu edited comment on FLINK-19401 at 10/6/20, 12:42 AM:
---

[~roman_khachatryan] I don't know if repeated checkpoint recovery is the 
root/main cause or not. [~trohrmann] identified two problems during his 
investigation. This is one of the identified problem. 

Regarding the logs related to Titus, please ignore them. They are just noise. 
We haven't cleaned up our logs yet.


was (Author: stevenz3wu):
[~roman_khachatryan] I don't know if repeated checkpoint recovery is the 
root/main cause or not. [~trohrmann] identified two problems during his 
investigation. This is one of the identified problem. 

Regarding the logs related to Titusm please ignore them. They are just noise. 
We haven't cleaned up our logs yet.

> Job stuck in restart loop due to excessive checkpoint recoveries which block 
> the JobMaster
> --
>
> Key: FLINK-19401
> URL: https://issues.apache.org/jira/browse/FLINK-19401
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Checkpointing
>Affects Versions: 1.10.1, 1.11.2
>Reporter: Steven Zhen Wu
>Priority: Critical
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> Flink job sometimes got into a restart loop for many hours and can't recover 
> until redeployed. We had some issue with Kafka that initially caused the job 
> to restart. 
> Below is the first of the many exceptions for "ResourceManagerException: 
> Could not find registered job manager" error.
> {code}
> 2020-09-19 00:03:31,614 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{171f1df017dab3a42c032abd07908b9b}] and profile ResourceP
> rofile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{cc7d136c4ce1f32285edd4928e3ab2e2}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{024c8a48dafaf8f07c49dd4320d5cc94}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,615 INFO  
> org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl 
> [flink-akka.actor.default-dispatcher-35973]  - Requesting new slot 
> [SlotRequestId{a591eda805b3081ad2767f5641d0db06}] and profile 
> ResourceProfile{UNKNOWN} from resource manager.
> 2020-09-19 00:03:31,620 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph   
> [flink-akka.actor.default-dispatcher-35973]  - Source: k2-csevpc -> 
> k2-csevpcRaw -> (vhsPlaybackEvents -> Flat Map, merchImpressionsClientLog -> 
> Flat Map) (56/640) (1b0d3dd1f19890886ff373a3f08809e8) switched from SCHEDULED 
> to FAILED.
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> No pooled slot available and request to ResourceManager for new slot failed
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:433)
> at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forward$21(FutureUtils.java:1065)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at 
> java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:792)
> at 
> java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2153)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.forward(FutureUtils.java:1063)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager.createRootSlot(SlotSharingManager.java:155)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateMultiTaskSlot(SchedulerImpl.java:511)
> at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.allocateSharedSlot(SchedulerImpl.java:311)
> at 
> 

[jira] [Comment Edited] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-05 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208335#comment-17208335
 ] 

Roman Khachatryan edited comment on FLINK-19401 at 10/5/20, 9:42 PM:
-

I do see that the same data is downloaded over and over, but I'm not sure 
whether it's the root cause.
 From the logs I see that heartbeats *on JM* from TMs are expired first, and 
heartbeats *on TMs* from JM expire two minutes later:
{code:java}
flink-19401 $ grep -ah -B1 'Heartbeat of Task' *.log | grep '^2020' | cut -c-16 
| sort | uniq -c
 66 2020-09-18 23:57
 11 2020-09-19 01:06
flink-19401 $ grep -ah -B1 'heartbeat of Job' *.log | grep '^2020' | cut -c-16 
| sort | uniq -c
509 2020-09-18 23:59
  2 2020-09-19 00:03
  2 2020-09-19 00:32
  2 2020-09-19 00:37
  2 2020-09-19 01:06
{code}
(77 vs 517 is due to parallelism level I think)

 

Furthermore, the first 8 recoveries start after failures in Kafka: NPE in 
NetworkClient and "SSLProtocolException: Handshake message sequence violation, 
2".

Then they happen after Kafka failures and the aforementioned TM heartbeat 
timeouts on JM.

 

So it looks like that the root cause is misconfiguration and/or network.

WDYT [~trohrmann], [~stevenz3wu] ?

 

Steven, can you please also explain what Titus is doing here?
 (looks like it'is adding and then removing the task every minute - does it 
ring any bell?)
{code:java}
$ grep 72cc604b-dcd5-4c75-a466-0321d7c51c3e *.log

2020-09-18 23:55:43,804 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60067 ms.
2020-09-18 23:56:11,256 - Deploying Source: ee_clevent_presentation -> 
ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) 
(attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,256 - Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw 
-> (Filter -> cybertronLogReplicated, Filter -> 
cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, 
gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 
1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
(dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw 
-> (cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw 
-> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: 
cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: 
cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:43,872 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-18 23:57:43,940 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60068 ms.
2020-09-18 23:58:44,008 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-18 23:59:11,794 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw 
-> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) 
(488/640) (attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:27,101 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw 
-> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) 
(478/640) (attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:44,075 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60066 ms.
2020-09-19 00:00:44,144 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-19 00:01:44,634 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60489 ms.
2020-09-19 00:02:44,710 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-19 00:03:44,799 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60088 ms.
2020-09-19 00:04:44,871 - Adding Titus task 

[jira] [Comment Edited] (FLINK-19401) Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

2020-10-05 Thread Roman Khachatryan (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-19401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17208335#comment-17208335
 ] 

Roman Khachatryan edited comment on FLINK-19401 at 10/5/20, 9:31 PM:
-

I do see that the same data is downloaded over and over, but I'm not sure 
whether it's the root cause.
 From the logs I see that heartbeat from *on JM* from TMs are expired first, 
and heartbeat *on TMs* expire two minutes later:
{code:java}
flink-19401 $ grep -ah -B1 'Heartbeat of Task' *.log | grep '^2020' | cut -c-16 
| sort | uniq -c
 66 2020-09-18 23:57
 11 2020-09-19 01:06
flink-19401 $ grep -ah -B1 'heartbeat of Job' *.log | grep '^2020' | cut -c-16 
| sort | uniq -c
509 2020-09-18 23:59
  2 2020-09-19 00:03
  2 2020-09-19 00:32
  2 2020-09-19 00:37
  2 2020-09-19 01:06
{code}
(77 vs 517 is due to parallelism level I think)

 

Furthermore, the first 8 recoveries start after failures in Kafka: NPE in 
NetworkClient and "SSLProtocolException: Handshake message sequence violation, 
2".

Then they happen after Kafka failures and the aforementioned TM heartbeat 
timeouts on JM.

 

So it looks like that the root cause is misconfiguration and/or network.

WDYT [~trohrmann], [~stevenz3wu] ?

 

Steven, can you please also explain what Titus is doing here?
(looks like it'is adding and then removing the task every minute - does it ring 
any bell?)
{code:java}
$ grep 72cc604b-dcd5-4c75-a466-0321d7c51c3e *.log

2020-09-18 23:55:43,804 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60067 ms.
2020-09-18 23:56:11,256 - Deploying Source: ee_clevent_presentation -> 
ee_clevent_presentationRaw -> cleventPresentation -> Flat Map (510/640) 
(attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,256 - Deploying Source: k2-logtracevpc -> k2-logtracevpcRaw 
-> (Filter -> cybertronLogReplicated, Filter -> 
cybertronLogReplicatedCrossRegion, gpsRequestPivotAudit, 
gpsRequestPivotAuditCrossRegion) (510/640) (attempt #1) to 
1273fbc6edbab45548807742b2db6c4e @ 72cc604b-dcd5-4c75-a466-0321d7c51c3e 
(dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Source: k2-ee_clevent -> k2-ee_cleventRaw 
-> (cleventAddToPlaylistCommand -> Flat Map, cleventThumbRating -> Flat Map) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw 
-> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: 
cybertron_joiner_client_input_0, Sink: cybertron_joiner_client_input_1) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:11,261 - Deploying Filter -> Process -> (Sink: 
cybertron_joiner_server_input_0, Sink: cybertron_joiner_server_input_1) 
(510/640) (attempt #1) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:56:43,872 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-18 23:57:43,940 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60068 ms.
2020-09-18 23:58:44,008 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-18 23:59:11,794 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw 
-> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) 
(488/640) (attempt #3) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:27,101 - Deploying Source: k2-defaultvpc -> k2-defaultvpcRaw 
-> (PssiPlaybackEvents -> Flat Map, Filter -> cybertronLogUnreplicated) 
(478/640) (attempt #2) to 1273fbc6edbab45548807742b2db6c4e @ 
72cc604b-dcd5-4c75-a466-0321d7c51c3e (dataPort=39833)
2020-09-18 23:59:44,075 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60066 ms.
2020-09-19 00:00:44,144 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-19 00:01:44,634 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60489 ms.
2020-09-19 00:02:44,710 - Adding Titus task 
72cc604b-dcd5-4c75-a466-0321d7c51c3e to TM termination pool because TM is not 
registered.
2020-09-19 00:03:44,799 - Titus task 72cc604b-dcd5-4c75-a466-0321d7c51c3e is 
being removed from TM termination pool after 60088 ms.
2020-09-19 00:04:44,871 - Adding Titus task