[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement
[ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697677#comment-16697677 ] Zhenqiu Huang commented on FLINK-10868: --- [~suez1224] [~till.rohrmann] Agree with Shuyi's proposal. As yarn.maximum-failed-containers is more a configuration for a job level rather than session cluster level. We may have a simple fix for Per Job cluster first to achieve feature parity with former release. 1) I will add a boolean parameter to YarnResourceManager to distinguish whether it runs for a per job cluster or session cluster. And also pass LeaderGatewayRetriever dispatcherGatewayRetriever as parameter of constructor of YarnResourceManager. 2) If it is per job cluster, One the threshold is hit, shutdownCluster by using DispatcherGateway. How do you think? > Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as > limit of resource acquirement > > > Key: FLINK-10868 > URL: https://issues.apache.org/jira/browse/FLINK-10868 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 1.7.0 >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > > Currently, YarnResourceManager does use yarn.maximum-failed-containers as > limit of resource acquirement. In worse case, when new start containers > consistently fail, YarnResourceManager will goes into an infinite resource > acquirement process without failing the job. Together with the > https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all > resources of yarn queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement
[ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16693833#comment-16693833 ] Shuyi Chen commented on FLINK-10868: Hi [~till.rohrmann], the following describes the the problem that we saw: 1) In YarnResourceManager, after container is allocated, it will start the container in onContainerAllocated(). 2) In createTaskExecutorLaunchContext, it will try to call fs.getFileStatus in registerLocalResource which access the file status on HDFS. 3) In rare scenario when some of above files in HDFS was not accessible due to HDFS issues. createTaskExecutorLaunchContext will throw an exception and cause YarnResourceManager to keep reacquiring resource due to container start failure because that the files are no longer accessible. In the above case, the job will be in a loop of acquiring new resources, since the files is already broken/missing, there is no way to recover by flink itself and we need to fail the job and fall back to the client side to fix the files and resubmit entirely. Together with [FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], it even exaggerate the problem and cause the entire YARN queue resource to get depleted. I've submitted a PR to fix [FLINK-10848|https://issues.apache.org/jira/browse/FLINK-10848], could you please also help take a look? I am wondering if we could separate this JIRA into 2 part, one for PerJobCluster, one for session cluster. For this jira, we could 1) apply yarn.maximum-failed-containers for PerJobCluster mode 2) log a warning saying that yarn.maximum-failed-containers is not supported for session cluster. 3) update the documentation on yarn.maximum-failed-containers on website What do you think? > Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as > limit of resource acquirement > > > Key: FLINK-10868 > URL: https://issues.apache.org/jira/browse/FLINK-10868 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 1.7.0 >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > > Currently, YarnResourceManager does use yarn.maximum-failed-containers as > limit of resource acquirement. In worse case, when new start containers > consistently fail, YarnResourceManager will goes into an infinite resource > acquirement process without failing the job. Together with the > https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all > resources of yarn queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement
[ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692892#comment-16692892 ] Till Rohrmann commented on FLINK-10868: --- Hi [~hpeter], I think this is not super trivially to achieve because to detect this situation properly, the RM needs a communication channel to the {{Dispatcher}} to tell him about the depleted resource requests. Moreover, we would need to fail all currently running jobs and wait for them to reach a global terminal state before we can shut down the cluster. At the moment, Flink assumes that the RM can acquire at some point the requested resources and that it should retry in case of a TM failure. In which scenario would you like to stop retrying if there is a chance to regain the resources and finish your job? > Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as > limit of resource acquirement > > > Key: FLINK-10868 > URL: https://issues.apache.org/jira/browse/FLINK-10868 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 1.7.0 >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > > Currently, YarnResourceManager does use yarn.maximum-failed-containers as > limit of resource acquirement. In worse case, when new start containers > consistently fail, YarnResourceManager will goes into an infinite resource > acquirement process without failing the job. Together with the > https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all > resources of yarn queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement
[ https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692708#comment-16692708 ] Zhenqiu Huang commented on FLINK-10868: --- [~till.rohrmann] I am working on a fix in FlinkYarnResourceManager. In PerJob cluster mode, as mini dispatch will kill itself once the only job stops, it should be easy to stop the cluster by kill the only JobMaster registered in RM with JobMasterGateway. But in session mode, I can only stop each of registered JobMaster when failed containers larger than the threshold set in configuration. Do you have any suggestion to stop session cluster gracefully? > Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as > limit of resource acquirement > > > Key: FLINK-10868 > URL: https://issues.apache.org/jira/browse/FLINK-10868 > Project: Flink > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 1.7.0 >Reporter: Zhenqiu Huang >Assignee: Zhenqiu Huang >Priority: Major > > Currently, YarnResourceManager does use yarn.maximum-failed-containers as > limit of resource acquirement. In worse case, when new start containers > consistently fail, YarnResourceManager will goes into an infinite resource > acquirement process without failing the job. Together with the > https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all > resources of yarn queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)