Hi Kanchi,

Could you provide with more information on it? Like at what stage this log 
prints (job recovering, running, etc), any more detailed job or stacktrace.

Best,
Zhanghao Chen
________________________________
From: Kanchi Masalia via user <user@flink.apache.org>
Sent: Friday, February 16, 2024 4:07
To: Neha Rawat <n.ra...@netwitness.com>
Cc: Kenan Kılıçtepe <kkilict...@gmail.com>; user@flink.apache.org 
<user@flink.apache.org>; Liang Mou <l...@pinterest.com>
Subject: Re: Task Manager getting killed while executing sql queries.

Hi!

We just encountered a similar issue.

This is usually caused by: 1) Akka failed sending the message silently, due to 
problems like oversized payload or serialization failures. In that case, you 
should find detailed error information in the logs. 2) The recipient needs more 
time for responding, due to problems like slow machines or network jitters. In 
that case, you can try to increase akka.ask.timeout.

Were you able to resolve the issue? Could you suggest what worked for you?


Thanks,
Kanchi Masalia



On Tue, Aug 29, 2023 at 6:18 AM Neha Rawat 
<n.ra...@netwitness.com<mailto:n.ra...@netwitness.com>> wrote:

Thanks for your response.

I did check and the memory utilization looks fine.  Attaching a VisualVM 
screenshot. Memory usage was well under the limits.

There are phases of low CPU consumption by Flink (below 10% with spikes that go 
upto 100%) and number of threads go down as well. That’s the time when I see 
Error#1. #2 and #3 as listed in the original email.



Thanks,

Neha



From: Kenan Kılıçtepe <kkilict...@gmail.com<mailto:kkilict...@gmail.com>>
Sent: Monday, August 28, 2023 4:25 PM
To: Neha Rawat <n.ra...@netwitness.com<mailto:n.ra...@netwitness.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org>
Subject: Re: Task Manager getting killed while executing sql queries.



You don't often get email from 
kkilict...@gmail.com<mailto:kkilict...@gmail.com>. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

CAUTION:External email. Do not click or open attachments unless you know and 
trust the sender.



Can it be a memory leak? Have you observed the memory consumption of task 
managers?

Once, task manager crush issue happened for me and it was OOM.





On Mon, Aug 28, 2023 at 9:12 PM Neha Rawat 
<n.ra...@netwitness.com<mailto:n.ra...@netwitness.com>> wrote:

Hi,



Need some help with the below situation. If would be great if someone could 
give some pointers on how to resolve this.



I am trying to execute 80 SQL queries on data coming in from source – Kafka 
Topic Event. The results are written to 2 sinks – Kafka topics – Alert & 
UnsortedAlert.

The input data is coming in at an EPS of about 230K Events/sec.



  *   Kafka was up and running all the time on the same host as flink.
  *   Job  configurations=>  --parallelism 20 --checkpoint-interval 60 
--aligned-checkpoint-timeout 5 --min-pause-checkpoints 5 --table-state-ttl 360
  *   Flink config -
state.backend.rocksdb.localdir: /var/netwitness/flink/rocksdb

state.backend.rocksdb.memory.write-buffer-ratio: 0.75

state.backend.rocksdb.log.level: WARN_LEVEL

state.backend.rocksdb.log.dir: /dev/null

state.backend.rocksdb.log.max-file-size: 1MB

state.backend.rocksdb.log.file-num: 1

state.backend.rocksdb.thread.num: 8

execution.checkpointing.timeout: 20 min



Have also tried the same test with flink config that has changelog enabled, 
have changed min-pause-checkpoints to 50 but the observation remains the same.



Observations-

  *   Checkpointing initially takes less than a second, but as the test 
progresses there are phases (3-4 consecutive checkpoints) where it takes more 
than a minutes and sometimes up to 9 minutes.
  *   Task manager gets killed after a couple of hours.
  *   These are the errors  in taskExecutor logs –



Error 1-

2023-08-22 22:53:59,441 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-16, groupId=Event] Disconnecting from 
node 1 due to request timeout.

2023-08-22 22:53:59,441 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-16, groupId=Event] Cancelled in-flight 
FETCH request with correlation id 445553 due to node 1 being disconnected 
(elapsed time since creation: 147696ms, elapsed time since send: 147696ms, 
request timeout: 30000ms)

2023-08-22 22:53:59,441 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-16, groupId=Event] Cancelled in-flight 
METADATA request with correlation id 445555 due to node 1 being disconnected 
(elapsed time since creation: 1ms, elapsed time since send: 1ms, request 
timeout: 30000ms)

2023-08-22 22:53:59,441 INFO  org.apache.kafka.clients.FetchSessionHandler      
           [] - [Consumer clientId=Event-16, groupId=Event] Error sending fetch 
request (sessionId=1726574973, epoch=367) to node 1:

org.apache.kafka.common.errors.DisconnectException: null

2023-08-22 22:54:06,975 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-19, groupId=Event] Disconnecting from 
node 1 due to request timeout.

2023-08-22 22:54:06,975 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-19, groupId=Event] Cancelled in-flight 
FETCH request with correlation id 446105 due to node 1 being disconnected 
(elapsed time since creation: 42100ms, elapsed time since send: 42100ms, 
request timeout: 30000ms)

2023-08-22 22:54:06,975 INFO  org.apache.kafka.clients.FetchSessionHandler      
           [] - [Consumer clientId=Event-19, groupId=Event] Error sending fetch 
request (sessionId=1951018275, epoch=1686) to node 1:

org.apache.kafka.common.errors.DisconnectException: null

2023-08-22 23:03:17,824 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-1, groupId=Event] Disconnecting from 
node 1 due to request timeout.

2023-08-22 23:03:17,825 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-1, groupId=Event] Cancelled in-flight 
FETCH request with correlation id 438228 due to node 1 being disconnected 
(elapsed time since creation: 61377ms, elapsed time since send: 61377ms, 
request timeout: 30000ms)

2023-08-22 23:03:17,826 INFO  org.apache.kafka.clients.FetchSessionHandler      
           [] - [Consumer clientId=Event-1, groupId=Event] Error sending fetch 
request (sessionId=869666967, epoch=5) to node 1:

org.apache.kafka.common.errors.DisconnectException: null

2023-08-22 23:05:06,655 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-3, groupId=Event] Disconnecting from 
node 1 due to request timeout.

2023-08-22 23:05:06,655 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-3, groupId=Event] Cancelled in-flight 
FETCH request with correlation id 439255 due to node 1 being disconnected 
(elapsed time since creation: 79499ms, elapsed time since send: 79499ms, 
request timeout: 30000ms)

2023-08-22 23:05:06,656 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-3, groupId=Event] Cancelled in-flight 
METADATA request with correlation id 439256 due to node 1 being disconnected 
(elapsed time since creation: 1ms, elapsed time since send: 1ms, request 
timeout: 30000ms)

2023-08-22 23:05:06,656 INFO  org.apache.kafka.clients.FetchSessionHandler      
           [] - [Consumer clientId=Event-3, groupId=Event] Error sending fetch 
request (sessionId=26073738, epoch=4) to node 1:

org.apache.kafka.common.errors.DisconnectException: null

2023-08-22 23:07:00,107 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-9, groupId=Event] Disconnecting from 
node 1 due to request timeout.

2023-08-22 23:07:00,107 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-9, groupId=Event] Cancelled in-flight 
FETCH request with correlation id 441161 due to node 1 being disconnected 
(elapsed time since creation: 44847ms, elapsed time since send: 44847ms, 
request timeout: 30000ms)

2023-08-22 23:07:00,108 INFO  org.apache.kafka.clients.FetchSessionHandler      
           [] - [Consumer clientId=Event-9, groupId=Event] Error sending fetch 
request (sessionId=2113325484, epoch=10) to node 1:

org.apache.kafka.common.errors.DisconnectException: null

2023-08-22 23:10:07,342 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-12, groupId=Event] Disconnecting from 
node 1 due to request timeout.

2023-08-22 23:10:07,343 INFO  org.apache.kafka.clients.NetworkClient            
           [] - [Consumer clientId=Event-12, groupId=Event] Cancelled in-flight 
FETCH request with correlation id 440678 due to node 1 being disconnected 
(elapsed time since creation: 51417ms, elapsed time since send: 51417ms, 
request timeout: 30000ms)



Error 2-

org.apache.kafka.common.errors.DisconnectException: null

2023-08-22 23:10:18,246 INFO  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
GlobalWindowAggregate[25] -> (Calc[26] -> UnsortedAlert[27]: Writer -> 
UnsortedAlert[27]: Committer, Calc[390] -> UnsortedAlert[391]: Writer -> 
UnsortedAlert[391]: Committer) (6/20)#0 - asynchronous part of checkpoint 5450 
could not be completed.

java.util.concurrent.CancellationException: null

        at java.util.concurrent.FutureTask.report(FutureTask.java:121) ~[?:?]

        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]

        at 
org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:57)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
 [flink-dist-1.17.1.jar:1.17.1]

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]

        at java.lang.Thread.run(Thread.java:829) [?:?]

2023-08-22 23:10:18,245 INFO  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
GlobalWindowAggregate[31] -> (Calc[32] -> UnsortedAlert[33]: Writer -> 
UnsortedAlert[33]: Committer, Calc[392] -> UnsortedAlert[393]: Writer -> 
UnsortedAlert[393]: Committer) (12/20)#0 - asynchronous part of checkpoint 5450 
could not be completed.

java.util.concurrent.CancellationException: null

        at java.util.concurrent.FutureTask.report(FutureTask.java:121) ~[?:?]

        at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]

        at 
org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:54)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
 [flink-dist-1.17.1.jar:1.17.1]

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]

        at java.lang.Thread.run(Thread.java:829) [?:?]







Error 3-

2023-08-22 23:10:23,647 INFO  
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] - 
GlobalWindowAggregate[25] -> (Calc[26] -> UnsortedAlert[27]: Writer -> 
UnsortedAlert[27]: Committer, Calc[390] -> UnsortedAlert[391]: Writer -> 
UnsortedAlert[391]: Committer) (7/20)#0 - asynchronous part of checkpoint 5450 
could not be completed.

java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.checkpoint.CheckpointException: The checkpoint was 
aborted due to exception of other subtasks sharing the ChannelState file.

        at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) 
~[?:?]

        at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]

        at 
org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:66)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
 [flink-dist-1.17.1.jar:1.17.1]

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]

        at java.lang.Thread.run(Thread.java:829) [?:?]

Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The 
checkpoint was aborted due to exception of other subtasks sharing the 
ChannelState file.

        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.fail(ChannelStateCheckpointWriter.java:298)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.failAndClearWriter(ChannelStateWriteRequestDispatcherImpl.java:212)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.handleCheckpointAbortRequest(ChannelStateWriteRequestDispatcherImpl.java:189)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:129)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:94)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:161)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:116)
 ~[flink-dist-1.17.1.jar:1.17.1]

        ... 1 more

Caused by: java.util.concurrent.CancellationException: checkpoint aborted via 
notification

        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpoint(SubtaskCheckpointCoordinatorImpl.java:455)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointAborted(SubtaskCheckpointCoordinatorImpl.java:409)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointAbortAsync$17(StreamTask.java:1387)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$19(StreamTask.java:1410)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90) 
~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788) 
~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
 ~[flink-dist-1.17.1.jar:1.17.1]

        at 
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931) 
~[flink-dist-1.17.1.jar:1.17.1]

        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745) 
~[flink-dist-1.17.1.jar:1.17.1]

        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562) 
~[flink-dist-1.17.1.jar:1.17.1]





Error 4 –

2023-08-23 10:00:52,852 WARN  org.apache.flink.runtime.taskmanager.Task         
           [] - Source: Event[10] -> (MiniBatchAssigner[11] -> (Calc[12] -> 
(LocalWindowAggregate[13], LocalWindowAggregate[381]), Calc[22] -> LocalW       
 indowAggregate[23], Calc[28] -> LocalWindowAggregate[29], Calc[34] -> 
(LocalWindowAggregate[35], LocalWindowAggregate[394]), Calc[40] -> 
LocalWindowAggregate[41], Calc[46] -> (LocalWindowAggregate[47], 
LocalWindowAggregate[401]),         Calc[52] -> LocalWindowAggregate[53], 
Calc[58] -> (WindowTableFunction[59] -> Calc[60] -> LocalWindowAggregate[61], 
WindowTableFunction[408] -> Calc[409] -> LocalWindowAggregate[410]), Calc[66] 
-> LocalWindowAggregate[67], Calc[        72] -> WindowTableFunction[73] -> 
Calc[74] -> LocalWindowAggregate[75], Calc[80] -> LocalWindowAggregate[81], 
Calc[86] -> LocalWindowAggregate[87], Calc[92] -> LocalWindowAggregate[93], 
Calc[98] -> (WindowTableFunction[99] -> Cal        c[100] -> 
LocalWindowAggregate[101], WindowTableFunction[425] -> Calc[426] -> 
LocalWindowAggregate[427]), Calc[106] -> (WindowTableFunction[107] -> Calc[108] 
-> LocalWindowAggregate[109], WindowTableFunction[432] -> Calc[433] ->         
LocalWindowAggregate[434]), Calc[114] -> (LocalWindowAggregate[115], 
LocalWindowAggregate[439]), Calc[120] -> LocalWindowAggregate[121], Calc[126] 
-> LocalWindowAggregate[127], Calc[132] -> LocalWindowAggregate[133], Calc[138] 
->         LocalWindowAggregate[139], Calc[144] -> (LocalWindowAggregate[145], 
LocalWindowAggregate[452]), Calc[150] -> LocalWindowAggregate[151], Calc[156] 
-> LocalWindowAggregate[157], Calc[162] -> LocalWindowAggregate[163], Calc[168] 
->         LocalWindowAggregate[169], Calc[174] -> LocalWindowAggregate[175], 
Calc[180] -> LocalWindowAggregate[181], Calc[186] -> LocalWindowAggregate[187], 
Calc[192] -> LocalWindowAggregate[193], Calc[198] -> LocalWindowAggregate[199], 
C        alc[204] -> (LocalWindowAggregate[205], LocalWindowAggregate[475]), 
Calc[210] -> (LocalWindowAggregate[211], LocalWindowAggregate[480]), Calc[216] 
-> LocalWindowAggregate[217], Calc[222] -> LocalWindowAggregate[223], Calc[228] 
->         LocalWindowAggregate[229], Calc[234] -> WindowTableFunction[235] -> 
Calc[236] -> LocalWindowAggregate[237], Calc[242] -> LocalWindowAggregate[243], 
Calc[248] -> LocalWindowAggregate[249], Calc[254] -> LocalWindowAggregate[255], 
        Calc[260] -> LocalWindowAggregate[261], Calc[266] -> 
LocalWindowAggregate[267], Calc[272] -> LocalWindowAggregate[273], Calc[278] -> 
WindowTableFunction[279] -> Calc[280] -> LocalWindowAggregate[281], Calc[290] 
-> LocalWindowAggr        egate[291], Calc[296] -> LocalWindowAggregate[297], 
Calc[302] -> LocalWindowAggregate[303], Calc[308] -> LocalWindowAggregate[309], 
Calc[314] -> LocalWindowAggregate[315], Calc[320] -> LocalWindowAggregate[321], 
Calc[326] -> Wind        owTableFunction[327] -> Calc[328] -> 
LocalWindowAggregate[329], Calc[338] -> LocalWindowAggregate[339], Calc[348] -> 
LocalWindowAggregate[349], Calc[354] -> LocalWindowAggregate[355], Calc[360] -> 
LocalWindowAggregate[361]), Mini        BatchAssigner[366] -> (Calc[367] -> 
Alert[368]: Writer -> Alert[368]: Committer, Calc[369] -> Alert[370]: Writer -> 
Alert[370]: Committer, Calc[371] -> Alert[372]: Writer -> Alert[372]: 
Committer, Calc[373] -> Alert[374]: Writer         -> Alert[374]: Committer, 
Calc[375] -> Alert[376]: Writer -> Alert[376]: Committer, Calc[377] -> 
Alert[378]: Writer -> Alert[378]: Committer, Calc[379] -> Alert[380]: Writer -> 
Alert[380]: Committer)) (1/20)#0 (335c0e92a4914fc728        
9f6e977222b76b_e3dfc0d7e9ecd8a43f85f0b68ebf3b80_0_0) switched from RUNNING to 
FAILED with failure cause:

  97274 java.util.concurrent.TimeoutException: Invocation of 
[RemoteRpcInvocation(JobMasterOperatorEventGateway.sendOperatorEventToCoordinator(ExecutionAttemptID,
 OperatorID, SerializedValue))] at recipient [akka.tcp://flink@localhost:61     
   23/user/rpc/jobmanager_3] timed out. This is usually caused by: 1) Akka 
failed sending the message silently, due to problems like oversized payload or 
serialization failures. In that case, you should find detailed error informati  
      on in the logs. 2) The recipient needs more time for responding, due to 
problems like slow machines or network jitters. In that case, you can try to 
increase akka.ask.timeout.

  97275         at 
com.sun.proxy.$Proxy25.sendOperatorEventToCoordinator(Unknown Source) ~[?:?]

  97276         at 
org.apache.flink.runtime.taskexecutor.rpc.RpcTaskOperatorEventGateway.sendOperatorEventToCoordinator(RpcTaskOperatorEventGateway.java:60)
 ~[flink-dist-1.17.1.jar:1.17.1]

  97277         at 
org.apache.flink.streaming.runtime.tasks.OperatorEventDispatcherImpl$OperatorEventGatewayImpl.sendEventToCoordinator(OperatorEventDispatcherImpl.java:120)
 ~[flink-dist-1.17.1.jar:1.17.1]

  97278         at 
org.apache.flink.streaming.runtime.tasks.OperatorChain.lambda$sendAcknowledgeCheckpointEvent$1(OperatorChain.java:947)
 ~[flink-dist-1.17.1.jar:1.17.1]

  97279         at java.util.HashMap$KeySet.forEach(HashMap.java:929) ~[?:?]

  97280         at 
org.apache.flink.streaming.runtime.tasks.OperatorChain.sendAcknowledgeCheckpointEvent(OperatorChain.java:943)
 ~[flink-dist-1.17.1.jar:1.17.1]

  97281         at 
org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:201)
 ~[flink-dist-1.17.1.jar:1.17.1]

  97282         at 
org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:715)
 ~[flink-dist-1.17.1.jar:1.17.1]





Thanks,

Neha Rawat

Caution: External email. Do not click or open attachments unless you know and 
trust the sender.

Caution: External email. Do not click or open attachments unless you know and 
trust the sender.

Reply via email to