I see - I should have checked my mailbox before answering. I received the email and was able to login.
On Thu, Aug 26, 2021 at 6:12 PM Matthias Pohl <matth...@ververica.com> wrote: > The link doesn't work, i.e. I'm redirected to a login page. It would be > also good to include the Flink logs and make them accessible for everyone. > This way others could share their perspective as well... > > On Thu, Aug 26, 2021 at 5:40 PM Shah, Siddharth [Engineering] < > siddharth.x.s...@gs.com> wrote: > >> Hi Matthias, >> >> >> >> Thank you for responding and taking time to look at the issue. >> >> >> >> Uploaded the yarn lags here: >> https://lockbox.gs.com/lockbox/folders/963b0f29-85ad-4580-b420-8c66d9c07a84/ >> and have also requested read permissions for you. Please let us know if >> you’re not able to see the files. >> >> >> >> >> >> *From:* Matthias Pohl <matth...@ververica.com> >> *Sent:* Thursday, August 26, 2021 9:47 AM >> *To:* Shah, Siddharth [Engineering] <siddharth.x.s...@ny.email.gs.com> >> *Cc:* user@flink.apache.org; Hailu, Andreas [Engineering] < >> andreas.ha...@ny.email.gs.com> >> *Subject:* Re: hdfs lease issues on flink retry >> >> >> >> Hi Siddharth, >> >> thanks for reaching out to the community. This might be a bug. Could you >> share your Flink and YARN logs? This way we could get a better >> understanding of what's going on. >> >> >> >> Best, >> Matthias >> >> >> >> On Tue, Aug 24, 2021 at 10:19 PM Shah, Siddharth [Engineering] < >> siddharth.x.s...@gs.com> wrote: >> >> Hi Team, >> >> >> >> We are seeing transient failures in the jobs mostly requiring higher >> resources and using flink RestartStrategies >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=> >> [1]. Upon checking the yarn logs we have observed hdfs lease issues when >> flink retry happens. The job originally fails for the first try with >> PartitionNotFoundException >> or NoResourceAvailableException., but on retry it seems form the yarn logs >> is that the lease for the temp sink directory is not yet released by the >> node from previous try. >> >> >> >> Initial Failure log message: >> >> >> >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate enough slots to run the job. Please make sure that the >> cluster has enough resources. >> >> at >> org.apache.flink.runtime.executiongraph.Execution.lambda$scheduleForExecution$0(Execution.java:461) >> >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >> >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >> >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >> >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >> >> at >> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:190) >> >> at >> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) >> >> at >> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) >> >> at >> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >> >> at >> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >> >> >> >> >> >> Retry failure log message: >> >> >> >> Caused by: >> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.FileAlreadyExistsException): >> >> /user/p2epda/lake/delp_prod/PROD/APPROVED/data/TECHRISK_SENTINEL/INFORMATION_REPORT/4377/temp/data/_temporary/0/_temporary/attempt__0000_r_000003_0/partMapper-r-00003.snappy.parquet >> for client 10.51.63.226 already exists >> >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:2815) >> >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2702) >> >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2586) >> >> at >> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:736) >> >> at >> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:409) >> >> at >> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) >> >> at >> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) >> >> >> >> >> >> >> >> I could verify that it’s the same nodes from previous try owning the >> lease, and checked for multiple jobs by matching IP addresses. Ideally, we >> want an internal retry to happen since there will be thousands of jobs >> running at a time and hard to manually retry them. >> >> >> >> This is our current restart config: >> >> executionEnv.setRestartStrategy(RestartStrategies.*fixedDelayRestart*(3, >> Time.*of*(10, TimeUnit.*SECONDS*))); >> >> >> >> Is it possible to resolve leases before a retry? Or is it possible to >> have different sink directories (increment attempt id somewhere) for every >> retry, that way we have no lease issues? Or do you have any other >> suggestion on resolving this? >> >> >> >> >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/execution/task_failure_recovery/#fixed-delay-restart-strategy >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ci.apache.org_projects_flink_flink-2Ddocs-2Drelease-2D1.13_docs_dev_execution_task-5Ffailure-5Frecovery_-23fixed-2Ddelay-2Drestart-2Dstrategy&d=DwMFaQ&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=eLqB-T65EFJVVpR6QlSfRHIga7DPK3o8yJvw_OhnMvk&m=qIClgDVq00Jp0qluJfWV-aGM7Sg7tAnr_I2yy4TtNaM&s=wL6-8B4mnGofyRWetXrTSw9FBSV-XTDnoHsPtzU7h7c&e=> >> >> >> >> >> >> Thanks, >> >> Siddharth >> >> >> >> >> ------------------------------ >> >> >> Your Personal Data: We may collect and process information about you that >> may be subject to data protection laws. For more information about how we >> use and disclose your personal data, how we protect your information, our >> legal basis to use your information, your rights and who you can contact, >> please refer to: www.gs.com/privacy-notices >> >> >> ------------------------------ >> >> Your Personal Data: We may collect and process information about you that >> may be subject to data protection laws. For more information about how we >> use and disclose your personal data, how we protect your information, our >> legal basis to use your information, your rights and who you can contact, >> please refer to: www.gs.com/privacy-notices >> > > -- Matthias Pohl | Engineer Follow us @VervericaData Ververica <https://www.ververica.com/> -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Yip Park Tung Jason, Jinwei (Kevin) Zhang, Karl Anton Wehner