[jira] [Comment Edited] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

Zhenqiu Huang (JIRA) Sat, 24 Nov 2018 09:59:26 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16697677#comment-16697677
 ]


Zhenqiu Huang edited comment on FLINK-10868 at 11/24/18 5:48 PM:
-----------------------------------------------------------------

[~suez1224] [~till.rohrmann]

Agree with Shuyi's proposal. As maximum-failed-containers is more a 
configuration for a job level rather than session cluster level. We may have a 
simple fix for Per Job cluster first to achieve feature parity with former 
release. 

1) I will add a boolean parameter to createResourceManager function to 
distinguish whether it runs for a per job cluster or session cluster. And also 
pass  LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever as 
one of parameters createResourceManager function in ResourceManagerFactory.

2) If it is per job cluster, One the threshold is hit, shutdownCluster by using 
DispatcherGateway. 

How do you think?


was (Author: zhenqiuhuang):
[~suez1224] [~till.rohrmann]

Agree with Shuyi's proposal. As yarn.maximum-failed-containers is more a 
configuration for a job level rather than session cluster level. We may have a 
simple fix for Per Job cluster first to achieve feature parity with former 
release. 

1) I will add a boolean parameter to YarnResourceManager to distinguish whether 
it runs for a per job cluster or session cluster. And also pass  
LeaderGatewayRetriever<DispatcherGateway> dispatcherGatewayRetriever as 
parameter of constructor of YarnResourceManager.

2) If it is per job cluster, One the threshold is hit, shutdownCluster by using 
DispatcherGateway. 

How do you think?

> Flink's JobCluster ResourceManager doesn't use yarn.maximum-failed-containers 
> as limit of resource acquirement
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10868) Flink's JobCluster ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

Reply via email to