[jira] [Closed] (FLINK-28325) DataOutputSerializer#writeBytes increase position twice
[ https://issues.apache.org/jira/browse/FLINK-28325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua closed FLINK-28325. Resolution: Duplicate > DataOutputSerializer#writeBytes increase position twice > --- > > Key: FLINK-28325 > URL: https://issues.apache.org/jira/browse/FLINK-28325 > Project: Flink > Issue Type: Bug > Components: Runtime / Task >Reporter: huweihua >Priority: Minor > Attachments: image-2022-06-30-18-14-50-827.png, > image-2022-06-30-18-15-18-590.png, image.png > > > Hi, I was looking at the code and found that DataOutputSerializer.writeBytes > increases the position twice, I feel it is a problem, please let me know if > it is for a special purpose > org.apache.flink.core.memory.DataOutputSerializer#writeBytes > > !image-2022-06-30-18-14-50-827.png!!image-2022-06-30-18-15-18-590.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28325) DataOutputSerializer#writeBytes increase position twice
huweihua created FLINK-28325: Summary: DataOutputSerializer#writeBytes increase position twice Key: FLINK-28325 URL: https://issues.apache.org/jira/browse/FLINK-28325 Project: Flink Issue Type: Bug Components: Runtime / Task Reporter: huweihua Attachments: image-2022-06-30-18-14-50-827.png, image-2022-06-30-18-15-18-590.png Hi, I was looking at the code and found that DataOutputSerializer.writeBytes increases the position twice, I feel it is a problem, please let me know if it is for a special purpose org.apache.flink.core.memory.DataOutputSerializer#writeBytes !image-2022-06-30-18-14-50-827.png!!image-2022-06-30-18-15-18-590.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.
[ https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua updated FLINK-26932: - Affects Version/s: 1.11.0 > TaskManager hung in cleanupAllocationBaseDirs not exit. > --- > > Key: FLINK-26932 > URL: https://issues.apache.org/jira/browse/FLINK-26932 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.11.0 >Reporter: huweihua >Priority: Major > Attachments: 1280X1280.png, > origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png > > > The disk TaskManager used had some fatal error. And then TaskManager hung in > cleanupAllocationBaseDirs and took the main thread. > > So this TaskManager would not respond to the > cancelTask/disconnectResourceManager request. > > At the same time, JobMaster already take this TaskManager is lost, and > schedule task to other TaskManager. > > This may cause some unexpected task running. > > After checking the log of TaskManager, TM already lost the connection with > ResourceManager, and it is always trying to register with ResourceManager. > The RegistrationTimeout cannot take effect because the main thread of > TaskManager is hung-up. > > I think there are two options to handle it. > Option 1: Add timeout for > TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid > some other methods would block main thread too. > Option 2: Move the registrationTimeout in another thread, we need to deal > will the concurrency problem > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.
[ https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua updated FLINK-26932: - Attachment: 1280X1280.png > TaskManager hung in cleanupAllocationBaseDirs not exit. > --- > > Key: FLINK-26932 > URL: https://issues.apache.org/jira/browse/FLINK-26932 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Reporter: huweihua >Priority: Major > Attachments: 1280X1280.png, > origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png > > > The disk TaskManager used had some fatal error. And then TaskManager hung in > cleanupAllocationBaseDirs and took the main thread. > > So this TaskManager would not respond to the > cancelTask/disconnectResourceManager request. > > At the same time, JobMaster already take this TaskManager is lost, and > schedule task to other TaskManager. > > This may cause some unexpected task running. > > After checking the log of TaskManager, TM already lost the connection with > ResourceManager, and it is always trying to register with ResourceManager. > The RegistrationTimeout cannot take effect because the main thread of > TaskManager is hung-up. > > I think there are two options to handle it. > Option 1: Add timeout for > TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid > some other methods would block main thread too. > Option 2: Move the registrationTimeout in another thread, we need to deal > will the concurrency problem > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.
[ https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua updated FLINK-26932: - Attachment: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png > TaskManager hung in cleanupAllocationBaseDirs not exit. > --- > > Key: FLINK-26932 > URL: https://issues.apache.org/jira/browse/FLINK-26932 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Reporter: huweihua >Priority: Major > Attachments: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png > > > The disk TaskManager used had some fatal error. And then TaskManager hung in > cleanupAllocationBaseDirs and took the main thread. > > So this TaskManager would not respond to the > cancelTask/disconnectResourceManager request. > > At the same time, JobMaster already take this TaskManager is lost, and > schedule task to other TaskManager. > > This may cause some unexpected task running. > > After checking the log of TaskManager, TM already lost the connection with > ResourceManager, and it is always trying to register with ResourceManager. > The RegistrationTimeout cannot take effect because the main thread of > TaskManager is hung-up. > > I think there are two options to handle it. > Option 1: Add timeout for > TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid > some other methods would block main thread too. > Option 2: Move the registrationTimeout in another thread, we need to deal > will the concurrency problem > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.
[ https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua updated FLINK-26932: - Description: The disk TaskManager used had some fatal error. And then TaskManager hung in cleanupAllocationBaseDirs and took the main thread. So this TaskManager would not respond to the cancelTask/disconnectResourceManager request. At the same time, JobMaster already take this TaskManager is lost, and schedule task to other TaskManager. This may cause some unexpected task running. After checking the log of TaskManager, TM already lost the connection with ResourceManager, and it is always trying to register with ResourceManager. The RegistrationTimeout cannot take effect because the main thread of TaskManager is hung-up. I think there are two options to handle it. Option 1: Add timeout for TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid some other methods would block main thread too. Option 2: Move the registrationTimeout in another thread, we need to deal will the concurrency problem was: The disk TaskManager used had some fatal error. And then TaskManager hung in cleanupAllocationBaseDirs and took the main thread. So this TaskManager would not respond to the cancelTask/disconnectResourceManager request. At the same time, JobMaster already take this TaskManager is lost, and schedule task to other TaskManager. This may cause some unexpected task running. After checking the log of TaskManager, TM already lost the connection with ResourceManager, and it is always trying to register with ResourceManager. The RegistrationTimeout cannot take effect because the main thread of TaskManager is hung-up. I think there are two options to handle it. Option 1: Add timeout for TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid some other methods would block main thread too. Option 2: Move the registrationTimeout in another thread, we need to deal will the concurrency problem > TaskManager hung in cleanupAllocationBaseDirs not exit. > --- > > Key: FLINK-26932 > URL: https://issues.apache.org/jira/browse/FLINK-26932 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Reporter: huweihua >Priority: Major > Attachments: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png > > > The disk TaskManager used had some fatal error. And then TaskManager hung in > cleanupAllocationBaseDirs and took the main thread. > > So this TaskManager would not respond to the > cancelTask/disconnectResourceManager request. > > At the same time, JobMaster already take this TaskManager is lost, and > schedule task to other TaskManager. > > This may cause some unexpected task running. > > After checking the log of TaskManager, TM already lost the connection with > ResourceManager, and it is always trying to register with ResourceManager. > The RegistrationTimeout cannot take effect because the main thread of > TaskManager is hung-up. > > I think there are two options to handle it. > Option 1: Add timeout for > TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid > some other methods would block main thread too. > Option 2: Move the registrationTimeout in another thread, we need to deal > will the concurrency problem > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.
[ https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua updated FLINK-26932: - Description: The disk TaskManager used had some fatal error. And then TaskManager hung in cleanupAllocationBaseDirs and took the main thread. So this TaskManager would not respond to the cancelTask/disconnectResourceManager request. At the same time, JobMaster already take this TaskManager is lost, and schedule task to other TaskManager. This may cause some unexpected task running. After checking the log of TaskManager, TM already lost the connection with ResourceManager, and it is always trying to register with ResourceManager. The RegistrationTimeout cannot take effect because the main thread of TaskManager is hung-up. I think there are two options to handle it. Option 1: Add timeout for TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid some other methods would block main thread too. Option 2: Move the registrationTimeout in another thread, we need to deal will the concurrency problem was: The disk TaskManager used had some fatal error. And then TaskManager hung in cleanupAllocationBaseDirs and took the main thread. So this TaskManager would not respond to the cancelTask/disconnectResourceManager request. At the same time, JobMaster already take this TaskManager is lost, and schedule task to other TaskManager. This may cause some unexpected task running. After checking the log of TaskManager, TM already lost the connection with ResourceManager, and it is always trying to register with ResourceManager. The RegistrationTimeout cannot take effect because the main thread of TaskManager is hung-up. I think there are two options to handle it. # Option 1: Add timeout for TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid some other methods would block main thread too. # Option 2: Move the registrationTimeout in another thread, we need to deal will the concurrency problem !https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=ZmVkMDNhZjZkNzA2NTNkOGZjNjJmNGM0ZGYxNGY2NDFfTnV4SUd0RzQ3WnVJRWpWdVBJNFFncEMzTHdZZ3U0WDFfVG9rZW46Ym94Y25zMG1GdWM5M2hKNzJEcXhyN0FmRFgxXzE2NDg2NTE4Njg6MTY0ODY1NTQ2OF9WNA! !https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhiZDU0NDg0NzU3ZjgwYmIxOTU0YzQyMTIxMGE4YzJfQkhLMVI2bEZGUnhpR210c1BDelZDRUI3YjJDY2Q1T3NfVG9rZW46Ym94Y250aG1UTjBXTmI2TTFqYlV1eG9MTnMwXzE2NDg2NTE4NzU6MTY0ODY1NTQ3NV9WNA! > TaskManager hung in cleanupAllocationBaseDirs not exit. > --- > > Key: FLINK-26932 > URL: https://issues.apache.org/jira/browse/FLINK-26932 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Reporter: huweihua >Priority: Major > Attachments: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png > > > The disk TaskManager used had some fatal error. And then TaskManager hung in > cleanupAllocationBaseDirs and took the main thread. > > So this TaskManager would not respond to the > cancelTask/disconnectResourceManager request. > > At the same time, JobMaster already take this TaskManager is lost, and > schedule task to other TaskManager. > > This may cause some unexpected task running. > > After checking the log of TaskManager, TM already lost the connection with > ResourceManager, and it is always trying to register with ResourceManager. > The RegistrationTimeout cannot take effect because the main thread of > TaskManager is hung-up. > > I think there are two options to handle it. > Option 1: Add timeout for > TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid > some other methods would block main thread too. > Option 2: Move the registrationTimeout in another thread, we need to deal > will the concurrency problem > > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26861) Log cannot be printed completely when use System.exit(1)
[ https://issues.apache.org/jira/browse/FLINK-26861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515021#comment-17515021 ] huweihua commented on FLINK-26861: -- Hi, [~chenzihao], Could you provide the log4j2 configuration? Is asynchronous mode enabled? > Log cannot be printed completely when use System.exit(1) > > > Key: FLINK-26861 > URL: https://issues.apache.org/jira/browse/FLINK-26861 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission, Deployment / YARN >Affects Versions: 1.12.2, 1.14.0 >Reporter: chenzihao >Priority: Major > Attachments: image-2022-03-25-16-35-27-550.png, > image-2022-03-25-16-35-39-531.png, image-2022-03-25-16-35-46-528.png > > > When I use Yarn-Application mode to submit Flink Job and there exist an > exception expected, but I cannot get it in jobmanager log. And I only get the > exit code 1 in yarn log. > !image-2022-03-25-16-35-27-550.png! > h2. Relative code > {code:java} > // YarnApplicationClusterEntryPoint#main > try { > program = getPackagedProgram(configuration); > } catch (Exception e) { > LOG.error("Could not create application program.", e); > Thread.sleep(10); // add this for test > System.exit(1); > } > {code} > h2. Log info > * The ERROR log doesn't exist before Thread.sleep. > !image-2022-03-25-16-35-39-531.png! > * The ERROR log exists after Thread.sleep. > !image-2022-03-25-16-35-46-528.png! > h2. Reason > System.exit() terminates the currently running Java Virtual Machine too fast > to print the ERROR log. > h2. Improvement > I think we can print the ERROR log to jobmanager.err before call > System.exit(1) so that we can get the exception info. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.
huweihua created FLINK-26932: Summary: TaskManager hung in cleanupAllocationBaseDirs not exit. Key: FLINK-26932 URL: https://issues.apache.org/jira/browse/FLINK-26932 Project: Flink Issue Type: Bug Components: Runtime / Coordination Reporter: huweihua The disk TaskManager used had some fatal error. And then TaskManager hung in cleanupAllocationBaseDirs and took the main thread. So this TaskManager would not respond to the cancelTask/disconnectResourceManager request. At the same time, JobMaster already take this TaskManager is lost, and schedule task to other TaskManager. This may cause some unexpected task running. After checking the log of TaskManager, TM already lost the connection with ResourceManager, and it is always trying to register with ResourceManager. The RegistrationTimeout cannot take effect because the main thread of TaskManager is hung-up. I think there are two options to handle it. # Option 1: Add timeout for TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid some other methods would block main thread too. # Option 2: Move the registrationTimeout in another thread, we need to deal will the concurrency problem !https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=ZmVkMDNhZjZkNzA2NTNkOGZjNjJmNGM0ZGYxNGY2NDFfTnV4SUd0RzQ3WnVJRWpWdVBJNFFncEMzTHdZZ3U0WDFfVG9rZW46Ym94Y25zMG1GdWM5M2hKNzJEcXhyN0FmRFgxXzE2NDg2NTE4Njg6MTY0ODY1NTQ2OF9WNA! !https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhiZDU0NDg0NzU3ZjgwYmIxOTU0YzQyMTIxMGE4YzJfQkhLMVI2bEZGUnhpR210c1BDelZDRUI3YjJDY2Q1T3NfVG9rZW46Ym94Y250aG1UTjBXTmI2TTFqYlV1eG9MTnMwXzE2NDg2NTE4NzU6MTY0ODY1NTQ3NV9WNA! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-18229) Pending worker requests should be properly cleared
[ https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514605#comment-17514605 ] huweihua commented on FLINK-18229: -- Sorry for the late reply, I'm stuck with something else. For this change, I wrote a small design document, could you help me reviewing it? https://docs.google.com/document/d/1lcmf3MKmcmf9tsPc1whaZHMYKurGoGtqOXEep9ngP2k/edit?usp=sharing [~xtsong] > Pending worker requests should be properly cleared > -- > > Key: FLINK-18229 > URL: https://issues.apache.org/jira/browse/FLINK-18229 > Project: Flink > Issue Type: Sub-task > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.9.3, 1.10.1, 1.11.0 >Reporter: Xintong Song >Assignee: huweihua >Priority: Major > Fix For: 1.15.0 > > > Currently, if Kubernetes/Yarn does not have enough resources to fulfill > Flink's resource requirement, there will be pending pod/container requests on > Kubernetes/Yarn. These pending resource requirements are never cleared until > either fulfilled or the Flink cluster is shutdown. > However, sometimes Flink no longer needs the pending resources. E.g., the > slot request is then fulfilled by another slots that become available, or the > job failed due to slot request timeout (in a session cluster). In such cases, > Flink does not remove the resource request until the resource is allocated, > then it discovers that it no longer needs the allocated resource and release > them. This would affect the underlying Kubernetes/Yarn cluster, especially > when the cluster is under heavy workload. > It would be good for Flink to cancel pod/container requests as earlier as > possible if it can discover that some of the pending workers are no longer > needed. > There are several approaches potentially achieve this. > # We can always check whether there's a pending worker that can be canceled > when a \{{PendingTaskManagerSlot}} is unassigned. > # We can have a separate timeout for requesting new worker. If the resource > cannot be allocated within the given time since requested, we should cancel > that resource request and claim a resource allocation failure. > # We can share the same timeout for starting new worker (proposed in > FLINK-13554). This is similar to 2), but it requires the worker to be > registered, rather than allocated, before timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26395) The description of RAND_INTEGER is wrong in SQL function documents
[ https://issues.apache.org/jira/browse/FLINK-26395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507304#comment-17507304 ] huweihua commented on FLINK-26395: -- [~libenchao] hi, I have submitted this PR, please help review when you have time > The description of RAND_INTEGER is wrong in SQL function documents > -- > > Key: FLINK-26395 > URL: https://issues.apache.org/jira/browse/FLINK-26395 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.15.0, 1.13.5, 1.14.3 >Reporter: huweihua >Assignee: huweihua >Priority: Minor > Labels: pull-request-available > Attachments: image-2022-02-28-19-57-18-390.png > > > RAND_INTEGER will returns a integer value, but document of SQL function shows > it will return a double value. > !image-2022-02-28-19-57-18-390.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26395) The description of RAND_INTEGER is wrong in SQL function documents
huweihua created FLINK-26395: Summary: The description of RAND_INTEGER is wrong in SQL function documents Key: FLINK-26395 URL: https://issues.apache.org/jira/browse/FLINK-26395 Project: Flink Issue Type: Bug Components: Documentation Affects Versions: 1.14.3, 1.13.5, 1.15.0 Reporter: huweihua Attachments: image-2022-02-28-19-57-18-390.png RAND_INTEGER will returns a integer value, but document of SQL function shows it will return a double value. !image-2022-02-28-19-57-18-390.png! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-18229) Pending worker requests should be properly cleared
[ https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497929#comment-17497929 ] huweihua commented on FLINK-18229: -- Hi [~xtsong] , I would like to deal with this issue, could you assign it to me? > Pending worker requests should be properly cleared > -- > > Key: FLINK-18229 > URL: https://issues.apache.org/jira/browse/FLINK-18229 > Project: Flink > Issue Type: Sub-task > Components: Deployment / Kubernetes, Deployment / YARN, Runtime / > Coordination >Affects Versions: 1.9.3, 1.10.1, 1.11.0 >Reporter: Xintong Song >Priority: Major > Fix For: 1.15.0 > > > Currently, if Kubernetes/Yarn does not have enough resources to fulfill > Flink's resource requirement, there will be pending pod/container requests on > Kubernetes/Yarn. These pending resource requirements are never cleared until > either fulfilled or the Flink cluster is shutdown. > However, sometimes Flink no longer needs the pending resources. E.g., the > slot request is then fulfilled by another slots that become available, or the > job failed due to slot request timeout (in a session cluster). In such cases, > Flink does not remove the resource request until the resource is allocated, > then it discovers that it no longer needs the allocated resource and release > them. This would affect the underlying Kubernetes/Yarn cluster, especially > when the cluster is under heavy workload. > It would be good for Flink to cancel pod/container requests as earlier as > possible if it can discover that some of the pending workers are no longer > needed. > There are several approaches potentially achieve this. > # We can always check whether there's a pending worker that can be canceled > when a \{{PendingTaskManagerSlot}} is unassigned. > # We can have a separate timeout for requesting new worker. If the resource > cannot be allocated within the given time since requested, we should cancel > that resource request and claim a resource allocation failure. > # We can share the same timeout for starting new worker (proposed in > FLINK-13554). This is similar to 2), but it requires the worker to be > registered, rather than allocated, before timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException
[ https://issues.apache.org/jira/browse/FLINK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210588#comment-17210588 ] huweihua commented on FLINK-17341: -- I think we can put the slots that need to be free in a list, and then free them out of the loop. [~trohrmann] > freeSlot in TaskExecutor.closeJobManagerConnection cause > ConcurrentModificationException > > > Key: FLINK-17341 > URL: https://issues.apache.org/jira/browse/FLINK-17341 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0, 1.10.2, 1.11.2 >Reporter: huweihua >Assignee: Matthias >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will > modify the TaskSlotTable.slotsPerJob. this modify will cause > ConcurrentModificationException. > {code:java} > Iterator activeSlots = taskSlotTable.getActiveSlots(jobId); > final FlinkException freeingCause = new FlinkException("Slot could not be > marked inactive."); > while (activeSlots.hasNext()) { > AllocationID activeSlot = activeSlots.next(); > try { > if (!taskSlotTable.markSlotInactive(activeSlot, > taskManagerConfiguration.getTimeout())) { > freeSlotInternal(activeSlot, freeingCause); > } > } catch (SlotNotFoundException e) { > log.debug("Could not mark the slot {} inactive.", jobId, e); > } > } > {code} > error log: > {code:java} > 2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor > - Caught exception while executing runnable in main thread. > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$KeyIterator.next(HashMap.java:1461) > at > org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698) > at > org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException
[ https://issues.apache.org/jira/browse/FLINK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203758#comment-17203758 ] huweihua commented on FLINK-17341: -- [~trohrmann] I would like to fix this issue, could you assign this to me? > freeSlot in TaskExecutor.closeJobManagerConnection cause > ConcurrentModificationException > > > Key: FLINK-17341 > URL: https://issues.apache.org/jira/browse/FLINK-17341 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.9.0, 1.10.2, 1.11.2 >Reporter: huweihua >Priority: Major > Fix For: 1.12.0, 1.10.3, 1.11.3 > > > TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will > modify the TaskSlotTable.slotsPerJob. this modify will cause > ConcurrentModificationException. > {code:java} > Iterator activeSlots = taskSlotTable.getActiveSlots(jobId); > final FlinkException freeingCause = new FlinkException("Slot could not be > marked inactive."); > while (activeSlots.hasNext()) { > AllocationID activeSlot = activeSlots.next(); > try { > if (!taskSlotTable.markSlotInactive(activeSlot, > taskManagerConfiguration.getTimeout())) { > freeSlotInternal(activeSlot, freeingCause); > } > } catch (SlotNotFoundException e) { > log.debug("Could not mark the slot {} inactive.", jobId, e); > } > } > {code} > error log: > {code:java} > 2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor > - Caught exception while executing runnable in main thread. > java.util.ConcurrentModificationException > at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) > at java.util.HashMap$KeyIterator.next(HashMap.java:1461) > at > org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698) > at > org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager
[ https://issues.apache.org/jira/browse/FLINK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua closed FLINK-15071. Release Note: not exist Resolution: Fixed > YARN vcore capacity check can not pass when use large slotPerTaskManager > > > Key: FLINK-15071 > URL: https://issues.apache.org/jira/browse/FLINK-15071 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.9.0 >Reporter: huweihua >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The check of YARN vcore capacity in > YarnClusterDescriptor.isReadyForDeployment can not pass If we config large > slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores > does not work. > This is because we set dynamicProperties after check isReadyForDeployment. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException
huweihua created FLINK-17341: Summary: freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException Key: FLINK-17341 URL: https://issues.apache.org/jira/browse/FLINK-17341 Project: Flink Issue Type: Bug Reporter: huweihua TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will modify the TaskSlotTable.slotsPerJob. this modify will cause ConcurrentModificationException. {code:java} Iterator activeSlots = taskSlotTable.getActiveSlots(jobId); final FlinkException freeingCause = new FlinkException("Slot could not be marked inactive."); while (activeSlots.hasNext()) { AllocationID activeSlot = activeSlots.next(); try { if (!taskSlotTable.markSlotInactive(activeSlot, taskManagerConfiguration.getTimeout())) { freeSlotInternal(activeSlot, freeingCause); } } catch (SlotNotFoundException e) { log.debug("Could not mark the slot {} inactive.", jobId, e); } } {code} error log: {code:java} 2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor - Caught exception while executing runnable in main thread. java.util.ConcurrentModificationException at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437) at java.util.HashMap$KeyIterator.next(HashMap.java:1461) at org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698) at org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652) at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314) at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149) at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections
[ https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082938#comment-17082938 ] huweihua commented on FLINK-10052: -- hi [~Tison] Is there any progress in the curator upgrade? > Tolerate temporarily suspended ZooKeeper connections > > > Key: FLINK-10052 > URL: https://issues.apache.org/jira/browse/FLINK-10052 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1 >Reporter: Till Rohrmann >Assignee: Zili Chen >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > This issue results from FLINK-10011 which uncovered a problem with Flink's HA > recovery and proposed the following solution to harden Flink: > The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator > recipe for leader election. The leader latch revokes leadership in case of a > suspended ZooKeeper connection. This can be premature in case that the system > can reconnect to ZooKeeper before its session expires. The effect of the lost > leadership is that all jobs will be canceled and directly restarted after > regaining the leadership. > Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper > connection, it would be better to wait until the ZooKeeper connection is > LOST. That way we would allow the system to reconnect and not lose the > leadership. This could be achievable by using Curator's {{LeaderSelector}} > instead of the {{LeaderLatch}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager
[ https://issues.apache.org/jira/browse/FLINK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058335#comment-17058335 ] huweihua edited comment on FLINK-15071 at 3/13/20, 1:37 AM: h4. [Chesnay Schepler|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=chesnay], thanks for your reply, It is always ignored the vcores configured by "yarn.containers.vcores". was (Author: huwh): Chesnay Schepler, thanks for your reply, It is always ignored the vcores configured by "yarn.containers.vcores". > YARN vcore capacity check can not pass when use large slotPerTaskManager > > > Key: FLINK-15071 > URL: https://issues.apache.org/jira/browse/FLINK-15071 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.9.0 >Reporter: huweihua >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The check of YARN vcore capacity in > YarnClusterDescriptor.isReadyForDeployment can not pass If we config large > slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores > does not work. > This is because we set dynamicProperties after check isReadyForDeployment. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager
[ https://issues.apache.org/jira/browse/FLINK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058335#comment-17058335 ] huweihua commented on FLINK-15071: -- Chesnay Schepler, thanks for your reply, It is always ignored the vcores configured by "yarn.containers.vcores". > YARN vcore capacity check can not pass when use large slotPerTaskManager > > > Key: FLINK-15071 > URL: https://issues.apache.org/jira/browse/FLINK-15071 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.9.0 >Reporter: huweihua >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The check of YARN vcore capacity in > YarnClusterDescriptor.isReadyForDeployment can not pass If we config large > slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores > does not work. > This is because we set dynamicProperties after check isReadyForDeployment. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16069) Creation of TaskDeploymentDescriptor can block main thread for long time
[ https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039191#comment-17039191 ] huweihua commented on FLINK-16069: -- Hi, [~trohrmann], thanks for your time. Yes, The iteration over the input edges taking so long. I didn't think too much about race condition, and i tried create the TaskDeploymentDescriptorFactory in the main thread, then put createDeploymentDescriptor into a future. This reduces the time taken in the main thread to 2s. Glad to receive any suggestions. [有道词典|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=chrome] Sorry for not t ...[详细|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=dict=chrome.extension]X 对不起没有考虑竞争条件。我很高兴 > Creation of TaskDeploymentDescriptor can block main thread for long time > > > Key: FLINK-16069 > URL: https://issues.apache.org/jira/browse/FLINK-16069 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: huweihua >Priority: Major > > The deploy of tasks will take long time when we submit a high parallelism > job. And Execution#deploy run in mainThread, so it will block JobMaster > process other akka messages, such as Heartbeat. The creation of > TaskDeploymentDescriptor take most of time. We can put the creation in future. > For example, A job [source(8000)->sink(8000)], the total 16000 tasks from > SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of > TaskManager timeout and job never success. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (FLINK-16069) Creation of TaskDeploymentDescriptor can block main thread for long time
[ https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039191#comment-17039191 ] huweihua edited comment on FLINK-16069 at 2/18/20 4:07 PM: --- Hi, [~trohrmann], thanks for your time. Yes, The iteration over the input edges taking so long. I didn't think too much about race condition, and i tried create the TaskDeploymentDescriptorFactory in the main thread, then put createDeploymentDescriptor into a future. This reduces the time taken in the main thread to 2s. Glad to receive any suggestions. was (Author: huwh): Hi, [~trohrmann], thanks for your time. Yes, The iteration over the input edges taking so long. I didn't think too much about race condition, and i tried create the TaskDeploymentDescriptorFactory in the main thread, then put createDeploymentDescriptor into a future. This reduces the time taken in the main thread to 2s. Glad to receive any suggestions. [有道词典|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=chrome] Sorry for not t ...[详细|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=dict=chrome.extension]X 对不起没有考虑竞争条件。我很高兴 > Creation of TaskDeploymentDescriptor can block main thread for long time > > > Key: FLINK-16069 > URL: https://issues.apache.org/jira/browse/FLINK-16069 > Project: Flink > Issue Type: Improvement > Components: Runtime / Coordination >Reporter: huweihua >Priority: Major > > The deploy of tasks will take long time when we submit a high parallelism > job. And Execution#deploy run in mainThread, so it will block JobMaster > process other akka messages, such as Heartbeat. The creation of > TaskDeploymentDescriptor take most of time. We can put the creation in future. > For example, A job [source(8000)->sink(8000)], the total 16000 tasks from > SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of > TaskManager timeout and job never success. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers
[ https://issues.apache.org/jira/browse/FLINK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038886#comment-17038886 ] huweihua commented on FLINK-12122: -- [~trohrmann] I think the minimum number of TaskExecutors is helpful. Do we have an issue for this feature? > Spread out tasks evenly across all available registered TaskManagers > > > Key: FLINK-12122 > URL: https://issues.apache.org/jira/browse/FLINK-12122 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Affects Versions: 1.6.4, 1.7.2, 1.8.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Major > Labels: pull-request-available > Fix For: 1.9.2, 1.10.0 > > Attachments: image-2019-05-21-12-28-29-538.png, > image-2019-05-21-13-02-50-251.png > > Time Spent: 20m > Remaining Estimate: 0h > > With Flip-6, we changed the default behaviour how slots are assigned to > {{TaskManagers}}. Instead of evenly spreading it out over all registered > {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a > tendency to first fill up a TM before using another one. This is a regression > wrt the pre Flip-6 code. > I suggest to change the behaviour so that we try to evenly distribute slots > across all available {{TaskManagers}} by considering how many of their slots > are already allocated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (FLINK-16069) Create TaskDeploymentDescriptor in future.
[ https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huweihua updated FLINK-16069: - Description: The deploy of tasks will take long time when we submit a high parallelism job. And Execution#deploy run in mainThread, so it will block JobMaster process other akka messages, such as Heartbeat. The creation of TaskDeploymentDescriptor take most of time. We can put the creation in future. For example, A job [source(8000)->sink(8000)], the total 16000 tasks from SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of TaskManager timeout and job never success. was: The deploy of tasks will took long time when we submit a high parallelism job. And Execution#deploy run in mainThread, so it will block JobMaster process other akka messages, such as Heartbeat. The creation of TaskDeploymentDescriptor take most of time. We can put the creation in future. For example, A job [source(8000)->sink(8000)], the total 16000 tasks from SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of TaskManager timeout and job never success. > Create TaskDeploymentDescriptor in future. > -- > > Key: FLINK-16069 > URL: https://issues.apache.org/jira/browse/FLINK-16069 > Project: Flink > Issue Type: Improvement > Components: Runtime / Task >Reporter: huweihua >Priority: Major > > The deploy of tasks will take long time when we submit a high parallelism > job. And Execution#deploy run in mainThread, so it will block JobMaster > process other akka messages, such as Heartbeat. The creation of > TaskDeploymentDescriptor take most of time. We can put the creation in future. > For example, A job [source(8000)->sink(8000)], the total 16000 tasks from > SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of > TaskManager timeout and job never success. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-16069) Create TaskDeploymentDescriptor in future.
huweihua created FLINK-16069: Summary: Create TaskDeploymentDescriptor in future. Key: FLINK-16069 URL: https://issues.apache.org/jira/browse/FLINK-16069 Project: Flink Issue Type: Improvement Components: Runtime / Task Reporter: huweihua The deploy of tasks will took long time when we submit a high parallelism job. And Execution#deploy run in mainThread, so it will block JobMaster process other akka messages, such as Heartbeat. The creation of TaskDeploymentDescriptor take most of time. We can put the creation in future. For example, A job [source(8000)->sink(8000)], the total 16000 tasks from SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of TaskManager timeout and job never success. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers
[ https://issues.apache.org/jira/browse/FLINK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015832#comment-17015832 ] huweihua commented on FLINK-12122: -- [~trohrmann] I have the same issue with [~liuyufei]. We run Flink in per-job mode. We have thousands of jobs that need to be upgraded to Flink 1.9 from Flink 1.5. the change of scheduling strategy cause load balance issue. This blocked our upgrade plan. In addition to the load balance issue, we also encountered other issues caused by Flink 1.9 scheduling strategy. # Network bandwidth. Tasks of the same type are scheduled to the one TaskManager, causing too much network traffic on the machine. # Some jobs need to sink to the local agent. After centralized scheduling, the insufficient processing capacity of the single machine causes a backlog of consumption. I think decentralized scheduling strategy is reasonable. > Spread out tasks evenly across all available registered TaskManagers > > > Key: FLINK-12122 > URL: https://issues.apache.org/jira/browse/FLINK-12122 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination >Affects Versions: 1.6.4, 1.7.2, 1.8.0 >Reporter: Till Rohrmann >Assignee: Till Rohrmann >Priority: Major > Labels: pull-request-available > Fix For: 1.9.2, 1.10.0 > > Attachments: image-2019-05-21-12-28-29-538.png, > image-2019-05-21-13-02-50-251.png > > Time Spent: 20m > Remaining Estimate: 0h > > With Flip-6, we changed the default behaviour how slots are assigned to > {{TaskManages}}. Instead of evenly spreading it out over all registered > {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a > tendency to first fill up a TM before using another one. This is a regression > wrt the pre Flip-6 code. > I suggest to change the behaviour so that we try to evenly distribute slots > across all available {{TaskManagers}} by considering how many of their slots > are already allocated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager
huweihua created FLINK-15071: Summary: YARN vcore capacity check can not pass when use large slotPerTaskManager Key: FLINK-15071 URL: https://issues.apache.org/jira/browse/FLINK-15071 Project: Flink Issue Type: Bug Components: Deployment / YARN Affects Versions: 1.9.0 Reporter: huweihua The check of YARN vcore capacity in YarnClusterDescriptor.isReadyForDeployment can not pass If we config large slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores does not work. This is because we set dynamicProperties after check isReadyForDeployment. -- This message was sent by Atlassian Jira (v8.3.4#803005)