Hi Doug,
thanks for reaching out to the community. First of all, 1.9.2 is quite an
old Flink version. You might want to consider upgrading to a newer version.
The community only offers support for the two most-recent Flink versions.
Newer version might include fixes for your issue.

But back to your actual problem: The logs you're providing only show that
some job switched into FINISHED state. Is there some error showing up
earlier in the logs which you might have missed? It would be helpful if you
could share the complete JobManager logs to get a better understanding of
what's going on.

Best,
Matthias

On Wed, Sep 29, 2021 at 3:47 PM Gusick, Doug S <doug.gus...@gs.com> wrote:

> Hello,
>
>
>
> We are facing an issue with some of our applications that are submitting a
> high volume of jobs to Flink (we are using v1.9.2). We are observing that
> numerous jobs (in this case 44 out of 350+) fail with the same
> FlinkJobNotFoundException within a 45 second timeframe.
>
>
>
> From our client logs, this is the exception we can see:
>
>
>
> Calc Engine: Caused by: 
> org.apache.flink.runtime.rest.util.RestClientException: 
> [org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
> Flink job (d0991f0ae712a9df710aa03311a32c8c)]
>
> Calc Engine:   at 
> org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)
>
> Calc Engine:   at 
> org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
>
> Calc Engine:   at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>
> Calc Engine:   ... 3 more
>
>
>
>
>
> This is the first job to fail with the above exception. From the
> JobManager logs, we can see that the job goes to FINISHED State, and then
> we see the following exception:
>
>
>
> 2021-09-28 04:54:16,936 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink
> Java Job at Tue Sep 28 04:48:21 EDT 2021 (d0991f0ae712a9df710aa03311a32c8c)
> switched from state RUNNING to FINISHED.
>
> 2021-09-28 04:54:16,937 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job
> d0991f0ae712a9df710aa03311a32c8c reached globally terminal state FINISHED.
>
> 2021-09-28 04:54:16,939 INFO  [flink-akka.actor.default-dispatcher-28]
> org.apache.flink.runtime.jobmaster.JobMaster                  - Stopping
> the JobMaster for job Flink Java Job at Tue Sep 28 04:48:21 EDT
> 2021(d0991f0ae712a9df710aa03311a32c8c).
>
> 2021-09-28 04:54:16,940 INFO  [flink-akka.actor.default-dispatcher-39]
> org.apache.flink.yarn.YarnResourceManager                     - Disconnect
> job manager
> 00000000000000000000000000000...@akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392
> for job d0991f0ae712a9df710aa03311a32c8c from the resource manager.
>
> 2021-09-28 04:54:18,256 ERROR [flink-akka.actor.default-dispatcher-91]
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  -
> Exception occurred in REST handler:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (d0991f0ae712a9df710aa03311a32c8c)
>
>
>
> Here are the relevant logs from the TaskManager. We can see that the 
> JobLeaderService tries to reconnect to the job. Any ideas as to why it is 
> trying to reconnect?:
>
>
> 2021-09-28 04:54:13,382 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Receive slot 
> request b26c04706fd5aad03dfdca8691f1bf1c for job 
> d0991f0ae712a9df710aa03311a32c8c from resource manager with leader id 
> 00000000000000000000000000000000.
>
> 2021-09-28 04:54:13,383 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Add job 
> d0991f0ae712a9df710aa03311a32c8c for job leader monitoring.
>
> 2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Successful 
> registration at job manager 
> akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392 for job 
> d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Establish 
> JobManager connection for job d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:13,397 INFO  [flink-akka.actor.default-dispatcher-16] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Offer 
> reserved slots to the leader of job d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:13,405 INFO  [CHAIN DataSource (settl_delivery_type_code | 
> DistCp | Sourcing Files) -> FlatMap (settl_delivery_type_code | DistCp | 
> Copying Batch Data) (1/1)] org.apache.flink.runtime.blob.BlobClient           
>            - Downloading 
> d0991f0ae712a9df710aa03311a32c8c/p-54dfdc41d9a995e5b75eb9eb29bcac91725fc425-be2bc5df9ef6ce331e8019cf32eb222b
>  from d43723-714.dc.gs.com/10.175.239.171:38726
>
> 2021-09-28 04:54:16,942 INFO  [flink-akka.actor.default-dispatcher-3] 
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable      - Free slot 
> TaskSlot(index:0, state:ACTIVE, resource profile: 
> ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, 
> directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, 
> networkMemoryInMB=2147483647, managedMemoryInMB=9318}, allocationId: 
> b26c04706fd5aad03dfdca8691f1bf1c, jobId: d0991f0ae712a9df710aa03311a32c8c).
>
> 2021-09-28 04:54:16,942 INFO  [flink-akka.actor.default-dispatcher-3] 
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Remove job 
> d0991f0ae712a9df710aa03311a32c8c from job leader monitoring.
>
> 2021-09-28 04:54:16,942 INFO  [flink-akka.actor.default-dispatcher-3] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close 
> JobManager connection for job d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:16,945 INFO  [flink-akka.actor.default-dispatcher-3] 
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close 
> JobManager connection for job d0991f0ae712a9df710aa03311a32c8c.
>
> 2021-09-28 04:54:16,945 INFO  [flink-akka.actor.default-dispatcher-3] 
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Cannot 
> reconnect to job d0991f0ae712a9df710aa03311a32c8c because it is not 
> registered.
>
>
>
>
>
> We have seen that increasing driver memory helps alleviate the issue, but
> for certain high volume applications, we still see the same exception, even
> with 50G driver memory. We would like to avoid increasing driver memory any
> further, if at all possible. We are unsure of the root cause, and any input
> would be greatly appreciated.
>
>
>
> Please let me know if any more information is required.
>
>
>
>
>
> Best,
>
> Doug
>
>
>
> *_______________________________________________________________*
>
> *Doug Gusick*
>
> Data Architecture
>
> Goldman Sachs
>
> 30 Hudson St| Jersey City, 07302
>
> Tel: +1 212 357 2640
>
> email: doug.gus...@gs.com
>
> *_____________________________________________________________*
>
> *The Goldman Sachs Group, Inc. All rights reserved*.
>
> See http://www.gs.com/disclaimer/global_email for important risk
> disclosures, conflicts of interest and other terms and conditions relating
> to this e-mail and your reliance on information contained in it.  This
> message may contain confidential or privileged information.  If you are not
> the intended recipient, please advise us immediately and delete this
> message.  See http://www.gs.com/disclaimer/email for further information
> on confidentiality and the risks of non-secure electronic communication.
> If you cannot access these links, please notify us by reply message and we
> will send the contents to you.
>
>
>
> ------------------------------
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices
>

Reply via email to