[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

2018-11-23 Thread Zhenqiu Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697677#comment-16697677
 ] 

Zhenqiu Huang commented on FLINK-10868:
---

[~suez1224] [~till.rohrmann]

Agree with Shuyi's proposal. As yarn.maximum-failed-containers is more a 
configuration for a job level rather than session cluster level. We may have a 
simple fix for Per Job cluster first to achieve feature parity with former 
release. 

1) I will add a boolean parameter to YarnResourceManager to distinguish whether 
it runs for a per job cluster or session cluster. And also pass  
LeaderGatewayRetriever dispatcherGatewayRetriever as 
parameter of constructor of YarnResourceManager.

2) If it is per job cluster, One the threshold is hit, shutdownCluster by using 
DispatcherGateway. 

How do you think?

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as 
> limit of resource acquirement
> 
>
> Key: FLINK-10868
> URL: https://issues.apache.org/jira/browse/FLINK-10868
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 1.7.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

2018-11-20 Thread Shuyi Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693833#comment-16693833
 ] 

Shuyi Chen commented on FLINK-10868:


Hi [~till.rohrmann], the following describes the the problem that we saw:
1) In YarnResourceManager, after container is allocated, it will start the 
container in onContainerAllocated().
2) In createTaskExecutorLaunchContext, it will try to call fs.getFileStatus in 
registerLocalResource which access the file status on HDFS.
3) In rare scenario when some of above files in HDFS was not accessible due to 
HDFS issues.  createTaskExecutorLaunchContext will throw an exception and cause 
YarnResourceManager to keep reacquiring resource due to container start failure 
because that the files are no longer accessible.

In the above case, the job will be in a loop of acquiring new resources, since 
the files is already broken/missing, there is no way to recover by flink itself 
and we need to fail the job and fall back to the client side to fix the files 
and resubmit entirely.  

Together with [FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], 
it even exaggerate the problem and cause the entire YARN queue resource to get 
depleted. I've submitted a PR to fix 
[FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], could you 
please also help take a look?

I am wondering if we could separate this JIRA into 2 part, one for 
PerJobCluster, one for session cluster. For this jira, we could
1) apply yarn.maximum-failed-containers for PerJobCluster mode
2) log a warning saying that yarn.maximum-failed-containers is not supported 
for session cluster.
3) update the documentation on yarn.maximum-failed-containers on website

What do you think?

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as 
> limit of resource acquirement
> 
>
> Key: FLINK-10868
> URL: https://issues.apache.org/jira/browse/FLINK-10868
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 1.7.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

2018-11-20 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692892#comment-16692892
 ] 

Till Rohrmann commented on FLINK-10868:
---

Hi [~hpeter], I think this is not super trivially to achieve because to detect 
this situation properly, the RM needs a communication channel to the 
{{Dispatcher}} to tell him about the depleted resource requests. Moreover, we 
would need to fail all currently running jobs and wait for them to reach a 
global terminal state before we can shut down the cluster.

At the moment, Flink assumes that the RM can acquire at some point the 
requested resources and that it should retry in case of a TM failure. In which 
scenario would you like to stop retrying if there is a chance to regain the 
resources and finish your job?

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as 
> limit of resource acquirement
> 
>
> Key: FLINK-10868
> URL: https://issues.apache.org/jira/browse/FLINK-10868
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 1.7.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

2018-11-19 Thread Zhenqiu Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692708#comment-16692708
 ] 

Zhenqiu Huang commented on FLINK-10868:
---

[~till.rohrmann]

I am working on a fix in FlinkYarnResourceManager. In PerJob cluster mode, as 
mini dispatch will kill itself once the only job stops, it should be easy to 
stop the cluster by kill the only JobMaster registered in RM with 
JobMasterGateway. But in session mode, I can only stop each of registered 
JobMaster when failed containers larger than the threshold set in 
configuration. Do you have any suggestion to stop session cluster gracefully?

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as 
> limit of resource acquirement
> 
>
> Key: FLINK-10868
> URL: https://issues.apache.org/jira/browse/FLINK-10868
> Project: Flink
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 1.7.0
>Reporter: Zhenqiu Huang
>Assignee: Zhenqiu Huang
>Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)