Hi!

We just encountered a similar issue.

This is usually caused by: 1) Akka failed sending the message silently, due
to problems like oversized payload or serialization failures. In that case,
you should find detailed error information in the logs. 2) The recipient
needs more time for responding, due to problems like slow machines or
network jitters. In that case, you can try to increase akka.ask.timeout.

Were you able to resolve the issue? Could you suggest what worked for you?


Thanks,
Kanchi Masalia



On Tue, Aug 29, 2023 at 6:18 AM Neha Rawat <n.ra...@netwitness.com> wrote:

> Thanks for your response.
>
> I did check and the memory utilization looks fine.  Attaching a VisualVM
> screenshot. Memory usage was well under the limits.
>
> There are phases of low CPU consumption by Flink (below 10% with spikes
> that go upto 100%) and number of threads go down as well. That’s the time
> when I see Error#1. #2 and #3 as listed in the original email.
>
>
>
> Thanks,
>
> Neha
>
>
>
> *From:* Kenan Kılıçtepe <kkilict...@gmail.com>
> *Sent:* Monday, August 28, 2023 4:25 PM
> *To:* Neha Rawat <n.ra...@netwitness.com>
> *Cc:* user@flink.apache.org
> *Subject:* Re: Task Manager getting killed while executing sql queries.
>
>
>
> You don't often get email from kkilict...@gmail.com. Learn why this is
> important <https://aka.ms/LearnAboutSenderIdentification>
>
> *CAUTION:*External email. Do not click or open attachments unless you
> know and trust the sender.
>
>
>
> Can it be a memory leak? Have you observed the memory consumption of
> task managers?
>
> Once, task manager crush issue happened for me and it was OOM.
>
>
>
>
>
> On Mon, Aug 28, 2023 at 9:12 PM Neha Rawat <n.ra...@netwitness.com> wrote:
>
> Hi,
>
>
>
> Need some help with the below situation. If would be great if someone
> could give some pointers on how to resolve this.
>
>
>
> I am trying to execute 80 SQL queries on data coming in from source –
> Kafka Topic Event. The results are written to 2 sinks – Kafka topics –
> Alert & UnsortedAlert.
>
> The input data is coming in at an EPS of about 230K Events/sec.
>
>
>
>    - Kafka was up and running all the time on the same host as flink.
>    - *Job  configurations=>*  --parallelism 20 --checkpoint-interval 60
>    --aligned-checkpoint-timeout 5 --min-pause-checkpoints 5 --table-state-ttl
>    360
>    -
> *Flink config - *state.backend.rocksdb.localdir:
>    /var/netwitness/flink/rocksdb
>
> state.backend.rocksdb.memory.write-buffer-ratio: 0.75
>
> state.backend.rocksdb.log.level: WARN_LEVEL
>
> state.backend.rocksdb.log.dir: /dev/null
>
> state.backend.rocksdb.log.max-file-size: 1MB
>
> state.backend.rocksdb.log.file-num: 1
>
> state.backend.rocksdb.thread.num: 8
>
> execution.checkpointing.timeout: 20 min
>
>
>
> Have also tried the same test with flink config that has changelog
> enabled, have changed min-pause-checkpoints to 50 but the observation
> remains the same.
>
>
>
> *Observations*-
>
>    - Checkpointing initially takes less than a second, but as the test
>    progresses there are phases (3-4 consecutive checkpoints) where it takes
>    more than a minutes and sometimes up to 9 minutes.
>    - Task manager gets killed after a couple of hours.
>    - These are the errors  in taskExecutor logs –
>
>
>
> Error 1-
>
> 2023-08-22 22:53:59,441 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-16, groupId=Event] Disconnecting from node 1 due to request
> timeout.
>
> 2023-08-22 22:53:59,441 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-16, groupId=Event] Cancelled in-flight FETCH request with
> correlation id 445553 due to node 1 being disconnected (elapsed time since
> creation: 147696ms, elapsed time since send: 147696ms, request timeout:
> 30000ms)
>
> 2023-08-22 22:53:59,441 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-16, groupId=Event] Cancelled in-flight METADATA request with
> correlation id 445555 due to node 1 being disconnected (elapsed time since
> creation: 1ms, elapsed time since send: 1ms, request timeout: 30000ms)
>
> 2023-08-22 22:53:59,441 INFO
> org.apache.kafka.clients.FetchSessionHandler                 [] - [Consumer
> clientId=Event-16, groupId=Event] Error sending fetch request
> (sessionId=1726574973, epoch=367) to node 1:
>
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2023-08-22 22:54:06,975 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-19, groupId=Event] Disconnecting from node 1 due to request
> timeout.
>
> 2023-08-22 22:54:06,975 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-19, groupId=Event] Cancelled in-flight FETCH request with
> correlation id 446105 due to node 1 being disconnected (elapsed time since
> creation: 42100ms, elapsed time since send: 42100ms, request timeout:
> 30000ms)
>
> 2023-08-22 22:54:06,975 INFO
> org.apache.kafka.clients.FetchSessionHandler                 [] - [Consumer
> clientId=Event-19, groupId=Event] Error sending fetch request
> (sessionId=1951018275, epoch=1686) to node 1:
>
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2023-08-22 23:03:17,824 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-1, groupId=Event] Disconnecting from node 1 due to request
> timeout.
>
> 2023-08-22 23:03:17,825 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-1, groupId=Event] Cancelled in-flight FETCH request with
> correlation id 438228 due to node 1 being disconnected (elapsed time since
> creation: 61377ms, elapsed time since send: 61377ms, request timeout:
> 30000ms)
>
> 2023-08-22 23:03:17,826 INFO
> org.apache.kafka.clients.FetchSessionHandler                 [] - [Consumer
> clientId=Event-1, groupId=Event] Error sending fetch request
> (sessionId=869666967, epoch=5) to node 1:
>
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2023-08-22 23:05:06,655 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-3, groupId=Event] Disconnecting from node 1 due to request
> timeout.
>
> 2023-08-22 23:05:06,655 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-3, groupId=Event] Cancelled in-flight FETCH request with
> correlation id 439255 due to node 1 being disconnected (elapsed time since
> creation: 79499ms, elapsed time since send: 79499ms, request timeout:
> 30000ms)
>
> 2023-08-22 23:05:06,656 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-3, groupId=Event] Cancelled in-flight METADATA request with
> correlation id 439256 due to node 1 being disconnected (elapsed time since
> creation: 1ms, elapsed time since send: 1ms, request timeout: 30000ms)
>
> 2023-08-22 23:05:06,656 INFO
> org.apache.kafka.clients.FetchSessionHandler                 [] - [Consumer
> clientId=Event-3, groupId=Event] Error sending fetch request
> (sessionId=26073738, epoch=4) to node 1:
>
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2023-08-22 23:07:00,107 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-9, groupId=Event] Disconnecting from node 1 due to request
> timeout.
>
> 2023-08-22 23:07:00,107 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-9, groupId=Event] Cancelled in-flight FETCH request with
> correlation id 441161 due to node 1 being disconnected (elapsed time since
> creation: 44847ms, elapsed time since send: 44847ms, request timeout:
> 30000ms)
>
> 2023-08-22 23:07:00,108 INFO
> org.apache.kafka.clients.FetchSessionHandler                 [] - [Consumer
> clientId=Event-9, groupId=Event] Error sending fetch request
> (sessionId=2113325484, epoch=10) to node 1:
>
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2023-08-22 23:10:07,342 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-12, groupId=Event] Disconnecting from node 1 due to request
> timeout.
>
> 2023-08-22 23:10:07,343 INFO
> org.apache.kafka.clients.NetworkClient                       [] - [Consumer
> clientId=Event-12, groupId=Event] Cancelled in-flight FETCH request with
> correlation id 440678 due to node 1 being disconnected (elapsed time since
> creation: 51417ms, elapsed time since send: 51417ms, request timeout:
> 30000ms)
>
>
>
> Error 2-
>
> org.apache.kafka.common.errors.DisconnectException: null
>
> 2023-08-22 23:10:18,246 INFO
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] -
> GlobalWindowAggregate[25] -> (Calc[26] -> UnsortedAlert[27]: Writer ->
> UnsortedAlert[27]: Committer, Calc[390] -> UnsortedAlert[391]: Writer ->
> UnsortedAlert[391]: Committer) (6/20)#0 - asynchronous part of checkpoint
> 5450 could not be completed.
>
> java.util.concurrent.CancellationException: null
>
>         at java.util.concurrent.FutureTask.report(FutureTask.java:121)
> ~[?:?]
>
>         at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>
>         at
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:57)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
> [flink-dist-1.17.1.jar:1.17.1]
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
>
>         at java.lang.Thread.run(Thread.java:829) [?:?]
>
> 2023-08-22 23:10:18,245 INFO
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] -
> GlobalWindowAggregate[31] -> (Calc[32] -> UnsortedAlert[33]: Writer ->
> UnsortedAlert[33]: Committer, Calc[392] -> UnsortedAlert[393]: Writer ->
> UnsortedAlert[393]: Committer) (12/20)#0 - asynchronous part of checkpoint
> 5450 could not be completed.
>
> java.util.concurrent.CancellationException: null
>
>         at java.util.concurrent.FutureTask.report(FutureTask.java:121)
> ~[?:?]
>
>         at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
>
>         at
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:544)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:54)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
> [flink-dist-1.17.1.jar:1.17.1]
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
>
>         at java.lang.Thread.run(Thread.java:829) [?:?]
>
>
>
>
>
>
>
> Error 3-
>
> 2023-08-22 23:10:23,647 INFO
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable [] -
> GlobalWindowAggregate[25] -> (Calc[26] -> UnsortedAlert[27]: Writer ->
> UnsortedAlert[27]: Committer, Calc[390] -> UnsortedAlert[391]: Writer ->
> UnsortedAlert[391]: Committer) (7/20)#0 - asynchronous part of checkpoint
> 5450 could not be completed.
>
> java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.checkpoint.CheckpointException: The checkpoint was
> aborted due to exception of other subtasks sharing the ChannelState file.
>
>         at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
> ~[?:?]
>
>         at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
> ~[?:?]
>
>         at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:66)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:191)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
> [flink-dist-1.17.1.jar:1.17.1]
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
>
>         at java.lang.Thread.run(Thread.java:829) [?:?]
>
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: The
> checkpoint was aborted due to exception of other subtasks sharing the
> ChannelState file.
>
>         at
> org.apache.flink.runtime.checkpoint.channel.ChannelStateCheckpointWriter.fail(ChannelStateCheckpointWriter.java:298)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.failAndClearWriter(ChannelStateWriteRequestDispatcherImpl.java:212)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.handleCheckpointAbortRequest(ChannelStateWriteRequestDispatcherImpl.java:189)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatchInternal(ChannelStateWriteRequestDispatcherImpl.java:129)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestDispatcherImpl.dispatch(ChannelStateWriteRequestDispatcherImpl.java:94)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.loop(ChannelStateWriteRequestExecutorImpl.java:161)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.checkpoint.channel.ChannelStateWriteRequestExecutorImpl.run(ChannelStateWriteRequestExecutorImpl.java:116)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         ... 1 more
>
> Caused by: java.util.concurrent.CancellationException: checkpoint aborted
> via notification
>
>         at
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpoint(SubtaskCheckpointCoordinatorImpl.java:455)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.notifyCheckpointAborted(SubtaskCheckpointCoordinatorImpl.java:409)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointAbortAsync$17(StreamTask.java:1387)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$notifyCheckpointOperation$19(StreamTask.java:1410)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMail(MailboxProcessor.java:398)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:367)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:352)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:229)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:839)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:788)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:952)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at
> org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:931)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:745)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>         at org.apache.flink.runtime.taskmanager.Task.run(Task.java:562)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>
>
>
>
> Error 4 –
>
> 2023-08-23 10:00:52,852 WARN
> org.apache.flink.runtime.taskmanager.Task                    [] - Source:
> Event[10] -> (MiniBatchAssigner[11] -> (Calc[12] ->
> (LocalWindowAggregate[13], LocalWindowAggregate[381]), Calc[22] ->
> LocalW        indowAggregate[23], Calc[28] -> LocalWindowAggregate[29],
> Calc[34] -> (LocalWindowAggregate[35], LocalWindowAggregate[394]), Calc[40]
> -> LocalWindowAggregate[41], Calc[46] -> (LocalWindowAggregate[47],
> LocalWindowAggregate[401]),         Calc[52] -> LocalWindowAggregate[53],
> Calc[58] -> (WindowTableFunction[59] -> Calc[60] ->
> LocalWindowAggregate[61], WindowTableFunction[408] -> Calc[409] ->
> LocalWindowAggregate[410]), Calc[66] -> LocalWindowAggregate[67],
> Calc[        72] -> WindowTableFunction[73] -> Calc[74] ->
> LocalWindowAggregate[75], Calc[80] -> LocalWindowAggregate[81], Calc[86] ->
> LocalWindowAggregate[87], Calc[92] -> LocalWindowAggregate[93], Calc[98] ->
> (WindowTableFunction[99] -> Cal        c[100] -> LocalWindowAggregate[101],
> WindowTableFunction[425] -> Calc[426] -> LocalWindowAggregate[427]),
> Calc[106] -> (WindowTableFunction[107] -> Calc[108] ->
> LocalWindowAggregate[109], WindowTableFunction[432] -> Calc[433] ->
> LocalWindowAggregate[434]), Calc[114] -> (LocalWindowAggregate[115],
> LocalWindowAggregate[439]), Calc[120] -> LocalWindowAggregate[121],
> Calc[126] -> LocalWindowAggregate[127], Calc[132] ->
> LocalWindowAggregate[133], Calc[138] ->         LocalWindowAggregate[139],
> Calc[144] -> (LocalWindowAggregate[145], LocalWindowAggregate[452]),
> Calc[150] -> LocalWindowAggregate[151], Calc[156] ->
> LocalWindowAggregate[157], Calc[162] -> LocalWindowAggregate[163],
> Calc[168] ->         LocalWindowAggregate[169], Calc[174] ->
> LocalWindowAggregate[175], Calc[180] -> LocalWindowAggregate[181],
> Calc[186] -> LocalWindowAggregate[187], Calc[192] ->
> LocalWindowAggregate[193], Calc[198] -> LocalWindowAggregate[199], C
> alc[204] -> (LocalWindowAggregate[205], LocalWindowAggregate[475]),
> Calc[210] -> (LocalWindowAggregate[211], LocalWindowAggregate[480]),
> Calc[216] -> LocalWindowAggregate[217], Calc[222] ->
> LocalWindowAggregate[223], Calc[228] ->         LocalWindowAggregate[229],
> Calc[234] -> WindowTableFunction[235] -> Calc[236] ->
> LocalWindowAggregate[237], Calc[242] -> LocalWindowAggregate[243],
> Calc[248] -> LocalWindowAggregate[249], Calc[254] ->
> LocalWindowAggregate[255],         Calc[260] -> LocalWindowAggregate[261],
> Calc[266] -> LocalWindowAggregate[267], Calc[272] ->
> LocalWindowAggregate[273], Calc[278] -> WindowTableFunction[279] ->
> Calc[280] -> LocalWindowAggregate[281], Calc[290] -> LocalWindowAggr
> egate[291], Calc[296] -> LocalWindowAggregate[297], Calc[302] ->
> LocalWindowAggregate[303], Calc[308] -> LocalWindowAggregate[309],
> Calc[314] -> LocalWindowAggregate[315], Calc[320] ->
> LocalWindowAggregate[321], Calc[326] -> Wind        owTableFunction[327] ->
> Calc[328] -> LocalWindowAggregate[329], Calc[338] ->
> LocalWindowAggregate[339], Calc[348] -> LocalWindowAggregate[349],
> Calc[354] -> LocalWindowAggregate[355], Calc[360] ->
> LocalWindowAggregate[361]), Mini        BatchAssigner[366] -> (Calc[367] ->
> Alert[368]: Writer -> Alert[368]: Committer, Calc[369] -> Alert[370]:
> Writer -> Alert[370]: Committer, Calc[371] -> Alert[372]: Writer ->
> Alert[372]: Committer, Calc[373] -> Alert[374]: Writer         ->
> Alert[374]: Committer, Calc[375] -> Alert[376]: Writer -> Alert[376]:
> Committer, Calc[377] -> Alert[378]: Writer -> Alert[378]: Committer,
> Calc[379] -> Alert[380]: Writer -> Alert[380]: Committer)) (1/20)#0
> (335c0e92a4914fc728
> 9f6e977222b76b_e3dfc0d7e9ecd8a43f85f0b68ebf3b80_0_0) switched from
> RUNNING to FAILED with failure cause:
>
>   97274 java.util.concurrent.TimeoutException: Invocation of
> [RemoteRpcInvocation(JobMasterOperatorEventGateway.sendOperatorEventToCoordinator(ExecutionAttemptID,
> OperatorID, SerializedValue))] at recipient [akka.tcp://flink@localhost:61
> 23/user/rpc/jobmanager_3] timed out. This is usually caused by: 1) Akka
> failed sending the message silently, due to problems like oversized payload
> or serialization failures. In that case, you should find detailed error
> informati        on in the logs. 2) The recipient needs more time for
> responding, due to problems like slow machines or network jitters. In that
> case, you can try to increase akka.ask.timeout.
>
>   97275         at
> com.sun.proxy.$Proxy25.sendOperatorEventToCoordinator(Unknown Source) ~[?:?]
>
>   97276         at
> org.apache.flink.runtime.taskexecutor.rpc.RpcTaskOperatorEventGateway.sendOperatorEventToCoordinator(RpcTaskOperatorEventGateway.java:60)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>   97277         at
> org.apache.flink.streaming.runtime.tasks.OperatorEventDispatcherImpl$OperatorEventGatewayImpl.sendEventToCoordinator(OperatorEventDispatcherImpl.java:120)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>   97278         at
> org.apache.flink.streaming.runtime.tasks.OperatorChain.lambda$sendAcknowledgeCheckpointEvent$1(OperatorChain.java:947)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>   97279         at java.util.HashMap$KeySet.forEach(HashMap.java:929)
> ~[?:?]
>
>   97280         at
> org.apache.flink.streaming.runtime.tasks.OperatorChain.sendAcknowledgeCheckpointEvent(OperatorChain.java:943)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>   97281         at
> org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.snapshotState(RegularOperatorChain.java:201)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>   97282         at
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:715)
> ~[flink-dist-1.17.1.jar:1.17.1]
>
>
>
>
>
> Thanks,
>
> Neha Rawat
>
> Caution: External email. Do not click or open attachments unless you know
> and trust the sender.
>
> Caution: External email. Do not click or open attachments unless you know
> and trust the sender.
>

Reply via email to