[jira] [Closed] (FLINK-28325) DataOutputSerializer#writeBytes increase position twice

2022-07-11 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua closed FLINK-28325.

Resolution: Duplicate

> DataOutputSerializer#writeBytes increase position twice
> ---
>
> Key: FLINK-28325
> URL: https://issues.apache.org/jira/browse/FLINK-28325
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Task
>Reporter: huweihua
>Priority: Minor
> Attachments: image-2022-06-30-18-14-50-827.png, 
> image-2022-06-30-18-15-18-590.png, image.png
>
>
> Hi, I was looking at the code and found that DataOutputSerializer.writeBytes 
> increases the position twice, I feel it is a problem, please let me know if 
> it is for a special purpose
> org.apache.flink.core.memory.DataOutputSerializer#writeBytes
>  
> !image-2022-06-30-18-14-50-827.png!!image-2022-06-30-18-15-18-590.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28325) DataOutputSerializer#writeBytes increase position twice

2022-06-30 Thread huweihua (Jira)
huweihua created FLINK-28325:


 Summary: DataOutputSerializer#writeBytes increase position twice
 Key: FLINK-28325
 URL: https://issues.apache.org/jira/browse/FLINK-28325
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Reporter: huweihua
 Attachments: image-2022-06-30-18-14-50-827.png, 
image-2022-06-30-18-15-18-590.png

Hi, I was looking at the code and found that DataOutputSerializer.writeBytes 
increases the position twice, I feel it is a problem, please let me know if it 
is for a special purpose

org.apache.flink.core.memory.DataOutputSerializer#writeBytes

 

!image-2022-06-30-18-14-50-827.png!!image-2022-06-30-18-15-18-590.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-04-15 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua updated FLINK-26932:
-
Affects Version/s: 1.11.0

> TaskManager hung in cleanupAllocationBaseDirs not exit.
> ---
>
> Key: FLINK-26932
> URL: https://issues.apache.org/jira/browse/FLINK-26932
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.11.0
>Reporter: huweihua
>Priority: Major
> Attachments: 1280X1280.png, 
> origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in 
> cleanupAllocationBaseDirs and took the main thread.
>  
> So this TaskManager would not respond to the 
> cancelTask/disconnectResourceManager request.
>  
> At the same time, JobMaster already take this TaskManager is lost, and 
> schedule task to other TaskManager.
>  
> This may cause some unexpected task running.
>  
> After checking the log of TaskManager, TM already lost the connection with 
> ResourceManager, and it is always trying to register with ResourceManager. 
> The RegistrationTimeout cannot take effect because the main thread of 
> TaskManager is hung-up.
>  
> I think there are two options to handle it.
> Option 1: Add timeout for 
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal 
> will the concurrency problem
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-03-30 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua updated FLINK-26932:
-
Attachment: 1280X1280.png

> TaskManager hung in cleanupAllocationBaseDirs not exit.
> ---
>
> Key: FLINK-26932
> URL: https://issues.apache.org/jira/browse/FLINK-26932
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: huweihua
>Priority: Major
> Attachments: 1280X1280.png, 
> origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in 
> cleanupAllocationBaseDirs and took the main thread.
>  
> So this TaskManager would not respond to the 
> cancelTask/disconnectResourceManager request.
>  
> At the same time, JobMaster already take this TaskManager is lost, and 
> schedule task to other TaskManager.
>  
> This may cause some unexpected task running.
>  
> After checking the log of TaskManager, TM already lost the connection with 
> ResourceManager, and it is always trying to register with ResourceManager. 
> The RegistrationTimeout cannot take effect because the main thread of 
> TaskManager is hung-up.
>  
> I think there are two options to handle it.
> Option 1: Add timeout for 
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal 
> will the concurrency problem
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-03-30 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua updated FLINK-26932:
-
Attachment: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png

> TaskManager hung in cleanupAllocationBaseDirs not exit.
> ---
>
> Key: FLINK-26932
> URL: https://issues.apache.org/jira/browse/FLINK-26932
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: huweihua
>Priority: Major
> Attachments: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in 
> cleanupAllocationBaseDirs and took the main thread.
>  
> So this TaskManager would not respond to the 
> cancelTask/disconnectResourceManager request.
>  
> At the same time, JobMaster already take this TaskManager is lost, and 
> schedule task to other TaskManager.
>  
> This may cause some unexpected task running.
>  
> After checking the log of TaskManager, TM already lost the connection with 
> ResourceManager, and it is always trying to register with ResourceManager. 
> The RegistrationTimeout cannot take effect because the main thread of 
> TaskManager is hung-up.
>  
> I think there are two options to handle it.
> Option 1: Add timeout for 
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal 
> will the concurrency problem
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-03-30 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua updated FLINK-26932:
-
Description: 
The disk TaskManager used had some fatal error. And then TaskManager hung in 
cleanupAllocationBaseDirs and took the main thread.
 
So this TaskManager would not respond to the 
cancelTask/disconnectResourceManager request.
 
At the same time, JobMaster already take this TaskManager is lost, and schedule 
task to other TaskManager.
 
This may cause some unexpected task running.
 
After checking the log of TaskManager, TM already lost the connection with 
ResourceManager, and it is always trying to register with ResourceManager. The 
RegistrationTimeout cannot take effect because the main thread of TaskManager 
is hung-up.
 
I think there are two options to handle it.

Option 1: Add timeout for 
TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
some other methods would block main thread too.
Option 2: Move the registrationTimeout in another thread, we need to deal will 
the concurrency problem

 
 

  was:
The disk TaskManager used had some fatal error. And then TaskManager hung in 
cleanupAllocationBaseDirs and took the main thread.
 
So this TaskManager would not respond to the 
cancelTask/disconnectResourceManager request.
 
At the same time, JobMaster already take this TaskManager is lost, and schedule 
task to other TaskManager.
 
This may cause some unexpected task running.
 
After checking the log of TaskManager, TM already lost the connection with 
ResourceManager, and it is always trying to register with ResourceManager. The 
RegistrationTimeout cannot take effect because the main thread of TaskManager 
is hung-up.
 
I think there are two options to handle it.

Option 1: Add timeout for 
TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
some other methods would block main thread too.
Option 2: Move the registrationTimeout in another thread, we need to deal will 
the concurrency problem

 

 
 


> TaskManager hung in cleanupAllocationBaseDirs not exit.
> ---
>
> Key: FLINK-26932
> URL: https://issues.apache.org/jira/browse/FLINK-26932
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: huweihua
>Priority: Major
> Attachments: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in 
> cleanupAllocationBaseDirs and took the main thread.
>  
> So this TaskManager would not respond to the 
> cancelTask/disconnectResourceManager request.
>  
> At the same time, JobMaster already take this TaskManager is lost, and 
> schedule task to other TaskManager.
>  
> This may cause some unexpected task running.
>  
> After checking the log of TaskManager, TM already lost the connection with 
> ResourceManager, and it is always trying to register with ResourceManager. 
> The RegistrationTimeout cannot take effect because the main thread of 
> TaskManager is hung-up.
>  
> I think there are two options to handle it.
> Option 1: Add timeout for 
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal 
> will the concurrency problem
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-03-30 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-26932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua updated FLINK-26932:
-
Description: 
The disk TaskManager used had some fatal error. And then TaskManager hung in 
cleanupAllocationBaseDirs and took the main thread.
 
So this TaskManager would not respond to the 
cancelTask/disconnectResourceManager request.
 
At the same time, JobMaster already take this TaskManager is lost, and schedule 
task to other TaskManager.
 
This may cause some unexpected task running.
 
After checking the log of TaskManager, TM already lost the connection with 
ResourceManager, and it is always trying to register with ResourceManager. The 
RegistrationTimeout cannot take effect because the main thread of TaskManager 
is hung-up.
 
I think there are two options to handle it.

Option 1: Add timeout for 
TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
some other methods would block main thread too.
Option 2: Move the registrationTimeout in another thread, we need to deal will 
the concurrency problem

 

 
 

  was:
The disk TaskManager used had some fatal error. And then TaskManager hung in 
cleanupAllocationBaseDirs and took the main thread.
 
So this TaskManager would not respond to the 
cancelTask/disconnectResourceManager request.
 
At the same time, JobMaster already take this TaskManager is lost, and schedule 
task to other TaskManager.
 
This may cause some unexpected task running.
 
After checking the log of TaskManager, TM already lost the connection with 
ResourceManager, and it is always trying to register with ResourceManager. The 
RegistrationTimeout cannot take effect because the main thread of TaskManager 
is hung-up.
 
I think there are two options to handle it. # 
Option 1: Add timeout for 
TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
some other methods would block main thread too.

 # 
Option 2: Move the registrationTimeout in another thread, we need to deal will 
the concurrency problem
!https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=ZmVkMDNhZjZkNzA2NTNkOGZjNjJmNGM0ZGYxNGY2NDFfTnV4SUd0RzQ3WnVJRWpWdVBJNFFncEMzTHdZZ3U0WDFfVG9rZW46Ym94Y25zMG1GdWM5M2hKNzJEcXhyN0FmRFgxXzE2NDg2NTE4Njg6MTY0ODY1NTQ2OF9WNA!
 
!https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhiZDU0NDg0NzU3ZjgwYmIxOTU0YzQyMTIxMGE4YzJfQkhLMVI2bEZGUnhpR210c1BDelZDRUI3YjJDY2Q1T3NfVG9rZW46Ym94Y250aG1UTjBXTmI2TTFqYlV1eG9MTnMwXzE2NDg2NTE4NzU6MTY0ODY1NTQ3NV9WNA!


> TaskManager hung in cleanupAllocationBaseDirs not exit.
> ---
>
> Key: FLINK-26932
> URL: https://issues.apache.org/jira/browse/FLINK-26932
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Reporter: huweihua
>Priority: Major
> Attachments: origin_img_v2_bb063beb-2f44-40fe-b1d2-4cc8dc87585g.png
>
>
> The disk TaskManager used had some fatal error. And then TaskManager hung in 
> cleanupAllocationBaseDirs and took the main thread.
>  
> So this TaskManager would not respond to the 
> cancelTask/disconnectResourceManager request.
>  
> At the same time, JobMaster already take this TaskManager is lost, and 
> schedule task to other TaskManager.
>  
> This may cause some unexpected task running.
>  
> After checking the log of TaskManager, TM already lost the connection with 
> ResourceManager, and it is always trying to register with ResourceManager. 
> The RegistrationTimeout cannot take effect because the main thread of 
> TaskManager is hung-up.
>  
> I think there are two options to handle it.
> Option 1: Add timeout for 
> TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
> some other methods would block main thread too.
> Option 2: Move the registrationTimeout in another thread, we need to deal 
> will the concurrency problem
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26861) Log cannot be printed completely when use System.exit(1)

2022-03-30 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515021#comment-17515021
 ] 

huweihua commented on FLINK-26861:
--

Hi, [~chenzihao], Could you provide the log4j2 configuration? Is asynchronous 
mode enabled?

> Log cannot be printed completely when use System.exit(1)
> 
>
> Key: FLINK-26861
> URL: https://issues.apache.org/jira/browse/FLINK-26861
> Project: Flink
>  Issue Type: Bug
>  Components: Client / Job Submission, Deployment / YARN
>Affects Versions: 1.12.2, 1.14.0
>Reporter: chenzihao
>Priority: Major
> Attachments: image-2022-03-25-16-35-27-550.png, 
> image-2022-03-25-16-35-39-531.png, image-2022-03-25-16-35-46-528.png
>
>
> When I use Yarn-Application mode to submit Flink Job and there exist an 
> exception expected, but I cannot get it in jobmanager log. And I only get the 
> exit code 1 in yarn log.
> !image-2022-03-25-16-35-27-550.png!
> h2. Relative code
> {code:java}
> // YarnApplicationClusterEntryPoint#main
> try {
> program = getPackagedProgram(configuration);
> } catch (Exception e) {
> LOG.error("Could not create application program.", e);
> Thread.sleep(10); // add this for test
> System.exit(1);
> }
> {code}
> h2. Log info
>  * The ERROR log doesn't exist before Thread.sleep.
> !image-2022-03-25-16-35-39-531.png!
>  * The ERROR log exists after Thread.sleep.
> !image-2022-03-25-16-35-46-528.png!
> h2. Reason
> System.exit() terminates the currently running Java Virtual Machine too fast 
> to print the ERROR log.
> h2. Improvement
> I think we can print the ERROR log to jobmanager.err before call 
> System.exit(1) so that we can get the exception info.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26932) TaskManager hung in cleanupAllocationBaseDirs not exit.

2022-03-30 Thread huweihua (Jira)
huweihua created FLINK-26932:


 Summary: TaskManager hung in cleanupAllocationBaseDirs not exit.
 Key: FLINK-26932
 URL: https://issues.apache.org/jira/browse/FLINK-26932
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Reporter: huweihua


The disk TaskManager used had some fatal error. And then TaskManager hung in 
cleanupAllocationBaseDirs and took the main thread.
 
So this TaskManager would not respond to the 
cancelTask/disconnectResourceManager request.
 
At the same time, JobMaster already take this TaskManager is lost, and schedule 
task to other TaskManager.
 
This may cause some unexpected task running.
 
After checking the log of TaskManager, TM already lost the connection with 
ResourceManager, and it is always trying to register with ResourceManager. The 
RegistrationTimeout cannot take effect because the main thread of TaskManager 
is hung-up.
 
I think there are two options to handle it. # 
Option 1: Add timeout for 
TaskExecutorLocalStateStoreManager.cleanupAllocationBaseDirs, But I am afraid 
some other methods would block main thread too.

 # 
Option 2: Move the registrationTimeout in another thread, we need to deal will 
the concurrency problem
!https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=ZmVkMDNhZjZkNzA2NTNkOGZjNjJmNGM0ZGYxNGY2NDFfTnV4SUd0RzQ3WnVJRWpWdVBJNFFncEMzTHdZZ3U0WDFfVG9rZW46Ym94Y25zMG1GdWM5M2hKNzJEcXhyN0FmRFgxXzE2NDg2NTE4Njg6MTY0ODY1NTQ2OF9WNA!
 
!https://bytedance.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhiZDU0NDg0NzU3ZjgwYmIxOTU0YzQyMTIxMGE4YzJfQkhLMVI2bEZGUnhpR210c1BDelZDRUI3YjJDY2Q1T3NfVG9rZW46Ym94Y250aG1UTjBXTmI2TTFqYlV1eG9MTnMwXzE2NDg2NTE4NzU6MTY0ODY1NTQ3NV9WNA!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-18229) Pending worker requests should be properly cleared

2022-03-30 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17514605#comment-17514605
 ] 

huweihua commented on FLINK-18229:
--

Sorry for the late reply, I'm stuck with something else.
For this change, I wrote a small design document, could you help me reviewing 
it?

https://docs.google.com/document/d/1lcmf3MKmcmf9tsPc1whaZHMYKurGoGtqOXEep9ngP2k/edit?usp=sharing

[~xtsong] 

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Assignee: huweihua
>Priority: Major
> Fix For: 1.15.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26395) The description of RAND_INTEGER is wrong in SQL function documents

2022-03-15 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507304#comment-17507304
 ] 

huweihua commented on FLINK-26395:
--

[~libenchao] hi, I have submitted this PR, please help review when you have time

> The description of RAND_INTEGER is wrong in SQL function documents
> --
>
> Key: FLINK-26395
> URL: https://issues.apache.org/jira/browse/FLINK-26395
> Project: Flink
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.15.0, 1.13.5, 1.14.3
>Reporter: huweihua
>Assignee: huweihua
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2022-02-28-19-57-18-390.png
>
>
> RAND_INTEGER will returns a integer value, but document of SQL function shows 
> it will return a double value.
> !image-2022-02-28-19-57-18-390.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26395) The description of RAND_INTEGER is wrong in SQL function documents

2022-02-28 Thread huweihua (Jira)
huweihua created FLINK-26395:


 Summary: The description of RAND_INTEGER is wrong in SQL function 
documents
 Key: FLINK-26395
 URL: https://issues.apache.org/jira/browse/FLINK-26395
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.14.3, 1.13.5, 1.15.0
Reporter: huweihua
 Attachments: image-2022-02-28-19-57-18-390.png

RAND_INTEGER will returns a integer value, but document of SQL function shows 
it will return a double value.

!image-2022-02-28-19-57-18-390.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-18229) Pending worker requests should be properly cleared

2022-02-24 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-18229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497929#comment-17497929
 ] 

huweihua commented on FLINK-18229:
--

Hi [~xtsong] , I would like to deal with this issue, could you assign it to me?

> Pending worker requests should be properly cleared
> --
>
> Key: FLINK-18229
> URL: https://issues.apache.org/jira/browse/FLINK-18229
> Project: Flink
>  Issue Type: Sub-task
>  Components: Deployment / Kubernetes, Deployment / YARN, Runtime / 
> Coordination
>Affects Versions: 1.9.3, 1.10.1, 1.11.0
>Reporter: Xintong Song
>Priority: Major
> Fix For: 1.15.0
>
>
> Currently, if Kubernetes/Yarn does not have enough resources to fulfill 
> Flink's resource requirement, there will be pending pod/container requests on 
> Kubernetes/Yarn. These pending resource requirements are never cleared until 
> either fulfilled or the Flink cluster is shutdown.
> However, sometimes Flink no longer needs the pending resources. E.g., the 
> slot request is then fulfilled by another slots that become available, or the 
> job failed due to slot request timeout (in a session cluster). In such cases, 
> Flink does not remove the resource request until the resource is allocated, 
> then it discovers that it no longer needs the allocated resource and release 
> them. This would affect the underlying Kubernetes/Yarn cluster, especially 
> when the cluster is under heavy workload.
> It would be good for Flink to cancel pod/container requests as earlier as 
> possible if it can discover that some of the pending workers are no longer 
> needed.
> There are several approaches potentially achieve this.
>  # We can always check whether there's a pending worker that can be canceled 
> when a \{{PendingTaskManagerSlot}} is unassigned.
>  # We can have a separate timeout for requesting new worker. If the resource 
> cannot be allocated within the given time since requested, we should cancel 
> that resource request and claim a resource allocation failure.
>  # We can share the same timeout for starting new worker (proposed in 
> FLINK-13554). This is similar to 2), but it requires the worker to be 
> registered, rather than allocated, before timeout.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException

2020-10-08 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210588#comment-17210588
 ] 

huweihua commented on FLINK-17341:
--

I think we can put the slots that need to be free in a list, and then free them 
out of the loop. [~trohrmann]

> freeSlot in TaskExecutor.closeJobManagerConnection cause 
> ConcurrentModificationException
> 
>
> Key: FLINK-17341
> URL: https://issues.apache.org/jira/browse/FLINK-17341
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.2, 1.11.2
>Reporter: huweihua
>Assignee: Matthias
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will 
> modify the TaskSlotTable.slotsPerJob. this modify will cause 
> ConcurrentModificationException.
> {code:java}
> Iterator activeSlots = taskSlotTable.getActiveSlots(jobId);
> final FlinkException freeingCause = new FlinkException("Slot could not be 
> marked inactive.");
> while (activeSlots.hasNext()) {
>  AllocationID activeSlot = activeSlots.next();
>  try {
>  if (!taskSlotTable.markSlotInactive(activeSlot, 
> taskManagerConfiguration.getTimeout())) {
>  freeSlotInternal(activeSlot, freeingCause);
>  }
>  } catch (SlotNotFoundException e) {
>  log.debug("Could not mark the slot {} inactive.", jobId, e);
>  }
> }
> {code}
>  error log:
> {code:java}
> 2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor  
>   - Caught exception while executing runnable in main thread.
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
> at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
> at 
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698)
> at 
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException

2020-09-29 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17203758#comment-17203758
 ] 

huweihua commented on FLINK-17341:
--

[~trohrmann]  I would like to fix this issue, could you assign this to me?

> freeSlot in TaskExecutor.closeJobManagerConnection cause 
> ConcurrentModificationException
> 
>
> Key: FLINK-17341
> URL: https://issues.apache.org/jira/browse/FLINK-17341
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.9.0, 1.10.2, 1.11.2
>Reporter: huweihua
>Priority: Major
> Fix For: 1.12.0, 1.10.3, 1.11.3
>
>
> TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will 
> modify the TaskSlotTable.slotsPerJob. this modify will cause 
> ConcurrentModificationException.
> {code:java}
> Iterator activeSlots = taskSlotTable.getActiveSlots(jobId);
> final FlinkException freeingCause = new FlinkException("Slot could not be 
> marked inactive.");
> while (activeSlots.hasNext()) {
>  AllocationID activeSlot = activeSlots.next();
>  try {
>  if (!taskSlotTable.markSlotInactive(activeSlot, 
> taskManagerConfiguration.getTimeout())) {
>  freeSlotInternal(activeSlot, freeingCause);
>  }
>  } catch (SlotNotFoundException e) {
>  log.debug("Could not mark the slot {} inactive.", jobId, e);
>  }
> }
> {code}
>  error log:
> {code:java}
> 2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor  
>   - Caught exception while executing runnable in main thread.
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
> at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
> at 
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698)
> at 
> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149)
> at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
> at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager

2020-06-04 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua closed FLINK-15071.

Release Note: not exist
  Resolution: Fixed

> YARN vcore capacity check can not pass when use large slotPerTaskManager
> 
>
> Key: FLINK-15071
> URL: https://issues.apache.org/jira/browse/FLINK-15071
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.9.0
>Reporter: huweihua
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The check of YARN vcore capacity in 
> YarnClusterDescriptor.isReadyForDeployment can not pass If we config large 
> slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores 
> does not work.
> This is because we set dynamicProperties after check isReadyForDeployment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-17341) freeSlot in TaskExecutor.closeJobManagerConnection cause ConcurrentModificationException

2020-04-23 Thread huweihua (Jira)
huweihua created FLINK-17341:


 Summary: freeSlot in TaskExecutor.closeJobManagerConnection cause 
ConcurrentModificationException
 Key: FLINK-17341
 URL: https://issues.apache.org/jira/browse/FLINK-17341
 Project: Flink
  Issue Type: Bug
Reporter: huweihua


TaskExecutor may freeSlot when closeJobManagerConnection. freeSlot will modify 
the TaskSlotTable.slotsPerJob. this modify will cause 
ConcurrentModificationException.
{code:java}
Iterator activeSlots = taskSlotTable.getActiveSlots(jobId);

final FlinkException freeingCause = new FlinkException("Slot could not be 
marked inactive.");

while (activeSlots.hasNext()) {
 AllocationID activeSlot = activeSlots.next();

 try {
 if (!taskSlotTable.markSlotInactive(activeSlot, 
taskManagerConfiguration.getTimeout())) {
 freeSlotInternal(activeSlot, freeingCause);
 }
 } catch (SlotNotFoundException e) {
 log.debug("Could not mark the slot {} inactive.", jobId, e);
 }
}
{code}
 error log:
{code:java}
2020-04-21 23:37:11,363 ERROR org.apache.flink.runtime.rpc.akka.AkkaRpcActor
- Caught exception while executing runnable in main thread.
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
at 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$TaskSlotIterator.hasNext(TaskSlotTable.java:698)
at 
org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable$AllocationIDIterator.hasNext(TaskSlotTable.java:652)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1314)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1300(TaskExecutor.java:149)
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$1(TaskExecutor.java:1726)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-10052) Tolerate temporarily suspended ZooKeeper connections

2020-04-14 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082938#comment-17082938
 ] 

huweihua commented on FLINK-10052:
--

hi [~Tison]

Is there any progress in the curator upgrade?

> Tolerate temporarily suspended ZooKeeper connections
> 
>
> Key: FLINK-10052
> URL: https://issues.apache.org/jira/browse/FLINK-10052
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Affects Versions: 1.4.2, 1.5.2, 1.6.0, 1.8.1
>Reporter: Till Rohrmann
>Assignee: Zili Chen
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This issue results from FLINK-10011 which uncovered a problem with Flink's HA 
> recovery and proposed the following solution to harden Flink:
> The {{ZooKeeperLeaderElectionService}} uses the {{LeaderLatch}} Curator 
> recipe for leader election. The leader latch revokes leadership in case of a 
> suspended ZooKeeper connection. This can be premature in case that the system 
> can reconnect to ZooKeeper before its session expires. The effect of the lost 
> leadership is that all jobs will be canceled and directly restarted after 
> regaining the leadership.
> Instead of directly revoking the leadership upon a SUSPENDED ZooKeeper 
> connection, it would be better to wait until the ZooKeeper connection is 
> LOST. That way we would allow the system to reconnect and not lose the 
> leadership. This could be achievable by using Curator's {{LeaderSelector}} 
> instead of the {{LeaderLatch}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager

2020-03-12 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058335#comment-17058335
 ] 

huweihua edited comment on FLINK-15071 at 3/13/20, 1:37 AM:


h4. [Chesnay 
Schepler|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=chesnay], 
thanks for your reply, It is always ignored the vcores configured by 
"yarn.containers.vcores". 


was (Author: huwh):
Chesnay Schepler, thanks for your reply, It is always ignored the vcores 
configured by "yarn.containers.vcores". 

> YARN vcore capacity check can not pass when use large slotPerTaskManager
> 
>
> Key: FLINK-15071
> URL: https://issues.apache.org/jira/browse/FLINK-15071
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.9.0
>Reporter: huweihua
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The check of YARN vcore capacity in 
> YarnClusterDescriptor.isReadyForDeployment can not pass If we config large 
> slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores 
> does not work.
> This is because we set dynamicProperties after check isReadyForDeployment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager

2020-03-12 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058335#comment-17058335
 ] 

huweihua commented on FLINK-15071:
--

Chesnay Schepler, thanks for your reply, It is always ignored the vcores 
configured by "yarn.containers.vcores". 

> YARN vcore capacity check can not pass when use large slotPerTaskManager
> 
>
> Key: FLINK-15071
> URL: https://issues.apache.org/jira/browse/FLINK-15071
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.9.0
>Reporter: huweihua
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The check of YARN vcore capacity in 
> YarnClusterDescriptor.isReadyForDeployment can not pass If we config large 
> slotsPerTaskManager(such as 96). The dynamic property yarn.containers.vcores 
> does not work.
> This is because we set dynamicProperties after check isReadyForDeployment.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16069) Creation of TaskDeploymentDescriptor can block main thread for long time

2020-02-18 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039191#comment-17039191
 ] 

huweihua commented on FLINK-16069:
--

Hi, [~trohrmann], thanks for your time. Yes, The iteration over the input edges 
taking so long. 

I didn't think too much about race condition, and i tried create the 
TaskDeploymentDescriptorFactory in the main thread, then put 
createDeploymentDescriptor into a future.  This reduces the time taken in the 
main thread to 2s. 

Glad to receive any suggestions.

 
[有道词典|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=chrome]
Sorry for not t 
...[详细|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=dict=chrome.extension]X
对不起没有考虑竞争条件。我很高兴

> Creation of TaskDeploymentDescriptor can block main thread for long time
> 
>
> Key: FLINK-16069
> URL: https://issues.apache.org/jira/browse/FLINK-16069
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: huweihua
>Priority: Major
>
> The deploy of tasks will take long time when we submit a high parallelism 
> job. And Execution#deploy run in mainThread, so it will block JobMaster 
> process other akka messages, such as Heartbeat. The creation of 
> TaskDeploymentDescriptor take most of time. We can put the creation in future.
> For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
> SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
> TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (FLINK-16069) Creation of TaskDeploymentDescriptor can block main thread for long time

2020-02-18 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039191#comment-17039191
 ] 

huweihua edited comment on FLINK-16069 at 2/18/20 4:07 PM:
---

Hi, [~trohrmann], thanks for your time. Yes, The iteration over the input edges 
taking so long. 

I didn't think too much about race condition, and i tried create the 
TaskDeploymentDescriptorFactory in the main thread, then put 
createDeploymentDescriptor into a future.  This reduces the time taken in the 
main thread to 2s. 

Glad to receive any suggestions.

 


was (Author: huwh):
Hi, [~trohrmann], thanks for your time. Yes, The iteration over the input edges 
taking so long. 

I didn't think too much about race condition, and i tried create the 
TaskDeploymentDescriptorFactory in the main thread, then put 
createDeploymentDescriptor into a future.  This reduces the time taken in the 
main thread to 2s. 

Glad to receive any suggestions.

 
[有道词典|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=chrome]
Sorry for not t 
...[详细|http://fanyi.youdao.com/translate?i=Sorry%20for%20not%20thinking%20about%20race%20condition.%20I%20am%20glad%20to=dict=chrome.extension]X
对不起没有考虑竞争条件。我很高兴

> Creation of TaskDeploymentDescriptor can block main thread for long time
> 
>
> Key: FLINK-16069
> URL: https://issues.apache.org/jira/browse/FLINK-16069
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Coordination
>Reporter: huweihua
>Priority: Major
>
> The deploy of tasks will take long time when we submit a high parallelism 
> job. And Execution#deploy run in mainThread, so it will block JobMaster 
> process other akka messages, such as Heartbeat. The creation of 
> TaskDeploymentDescriptor take most of time. We can put the creation in future.
> For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
> SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
> TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers

2020-02-18 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038886#comment-17038886
 ] 

huweihua commented on FLINK-12122:
--

[~trohrmann] I think the minimum number of TaskExecutors is helpful. Do we have 
an issue for this feature?

> Spread out tasks evenly across all available registered TaskManagers
> 
>
> Key: FLINK-12122
> URL: https://issues.apache.org/jira/browse/FLINK-12122
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.6.4, 1.7.2, 1.8.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.2, 1.10.0
>
> Attachments: image-2019-05-21-12-28-29-538.png, 
> image-2019-05-21-13-02-50-251.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With Flip-6, we changed the default behaviour how slots are assigned to 
> {{TaskManagers}}. Instead of evenly spreading it out over all registered 
> {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a 
> tendency to first fill up a TM before using another one. This is a regression 
> wrt the pre Flip-6 code.
> I suggest to change the behaviour so that we try to evenly distribute slots 
> across all available {{TaskManagers}} by considering how many of their slots 
> are already allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (FLINK-16069) Create TaskDeploymentDescriptor in future.

2020-02-15 Thread huweihua (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huweihua updated FLINK-16069:
-
Description: 
The deploy of tasks will take long time when we submit a high parallelism job. 
And Execution#deploy run in mainThread, so it will block JobMaster process 
other akka messages, such as Heartbeat. The creation of 
TaskDeploymentDescriptor take most of time. We can put the creation in future.

For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
TaskManager timeout and job never success.

  was:
The deploy of tasks will took long time when we submit a high parallelism job. 
And Execution#deploy run in mainThread, so it will block JobMaster process 
other akka messages, such as Heartbeat. The creation of 
TaskDeploymentDescriptor take most of time. We can put the creation in future.

For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
TaskManager timeout and job never success.


> Create TaskDeploymentDescriptor in future.
> --
>
> Key: FLINK-16069
> URL: https://issues.apache.org/jira/browse/FLINK-16069
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Reporter: huweihua
>Priority: Major
>
> The deploy of tasks will take long time when we submit a high parallelism 
> job. And Execution#deploy run in mainThread, so it will block JobMaster 
> process other akka messages, such as Heartbeat. The creation of 
> TaskDeploymentDescriptor take most of time. We can put the creation in future.
> For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
> SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
> TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-16069) Create TaskDeploymentDescriptor in future.

2020-02-14 Thread huweihua (Jira)
huweihua created FLINK-16069:


 Summary: Create TaskDeploymentDescriptor in future.
 Key: FLINK-16069
 URL: https://issues.apache.org/jira/browse/FLINK-16069
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Task
Reporter: huweihua


The deploy of tasks will took long time when we submit a high parallelism job. 
And Execution#deploy run in mainThread, so it will block JobMaster process 
other akka messages, such as Heartbeat. The creation of 
TaskDeploymentDescriptor take most of time. We can put the creation in future.

For example, A job [source(8000)->sink(8000)], the total 16000 tasks from 
SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of 
TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-12122) Spread out tasks evenly across all available registered TaskManagers

2020-01-15 Thread huweihua (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17015832#comment-17015832
 ] 

huweihua commented on FLINK-12122:
--

[~trohrmann] I have the same issue with [~liuyufei]. 

We run Flink in per-job mode. We have thousands of jobs that need to be 
upgraded to Flink 1.9 from Flink 1.5. the change of scheduling strategy cause 
load balance issue. This blocked our upgrade plan.
In addition to the load balance issue, we also encountered other issues caused 
by Flink 1.9 scheduling strategy. # Network bandwidth. Tasks of the same type 
are scheduled to the one TaskManager, causing too much network traffic on the 
machine.

 # Some jobs need to sink to the local agent. After centralized scheduling, the 
insufficient processing capacity of the single machine causes a backlog of 
consumption.

I think decentralized scheduling strategy is reasonable. 

> Spread out tasks evenly across all available registered TaskManagers
> 
>
> Key: FLINK-12122
> URL: https://issues.apache.org/jira/browse/FLINK-12122
> Project: Flink
>  Issue Type: Sub-task
>  Components: Runtime / Coordination
>Affects Versions: 1.6.4, 1.7.2, 1.8.0
>Reporter: Till Rohrmann
>Assignee: Till Rohrmann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.9.2, 1.10.0
>
> Attachments: image-2019-05-21-12-28-29-538.png, 
> image-2019-05-21-13-02-50-251.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> With Flip-6, we changed the default behaviour how slots are assigned to 
> {{TaskManages}}. Instead of evenly spreading it out over all registered 
> {{TaskManagers}}, we randomly pick slots from {{TaskManagers}} with a 
> tendency to first fill up a TM before using another one. This is a regression 
> wrt the pre Flip-6 code.
> I suggest to change the behaviour so that we try to evenly distribute slots 
> across all available {{TaskManagers}} by considering how many of their slots 
> are already allocated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-15071) YARN vcore capacity check can not pass when use large slotPerTaskManager

2019-12-05 Thread huweihua (Jira)
huweihua created FLINK-15071:


 Summary: YARN vcore capacity check can not pass when use large 
slotPerTaskManager
 Key: FLINK-15071
 URL: https://issues.apache.org/jira/browse/FLINK-15071
 Project: Flink
  Issue Type: Bug
  Components: Deployment / YARN
Affects Versions: 1.9.0
Reporter: huweihua


The check of YARN vcore capacity in YarnClusterDescriptor.isReadyForDeployment 
can not pass If we config large slotsPerTaskManager(such as 96). The dynamic 
property yarn.containers.vcores does not work.

This is because we set dynamicProperties after check isReadyForDeployment.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)