[jira] [Updated] (YARN-5703) ReservationAgents are not correctly configured
[ https://issues.apache.org/jira/browse/YARN-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikandan R updated YARN-5703: --- Attachment: YARN-5703.003.patch > ReservationAgents are not correctly configured > -- > > Key: YARN-5703 > URL: https://issues.apache.org/jira/browse/YARN-5703 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 3.0.0-alpha1 >Reporter: Sean Po >Assignee: Manikandan R > Attachments: YARN-5703.001.patch, YARN-5703.002.patch, > YARN-5703.003.patch > > > In AbstractReservationSystem, the method that instantiates a ReservationAgent > does not properly initialize it with the appropriate configuration because it > expects the ReservationAgent to implement Configurable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5703) ReservationAgents are not correctly configured
[ https://issues.apache.org/jira/browse/YARN-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853216#comment-15853216 ] Manikandan R commented on YARN-5703: Thanks [~Naganarasimha] for review. I've attached newer patch based on comments and given explanation for each one below. * Ok, I used {code} agentclass = Configuration.getClass(String, Class, Class) {code} to create objects. * I've defined few constants in {code}ReservationSchedulerConfiguration{code}. To access those values through its corresponding getters method in {code}init(ReservationSchedulerConfiguration conf) {code} of AlignedPlannerWithGreedy & GreedyReservationAgent classes, I've defined the method {code} void init(ReservationSchedulerConfiguration conf); {code} in ReservationAgent.java. If I don't specify right sub class, those constants and its getters won't be available? Do you see any issues here? * Yes, configurable doesn't suit here. But, {code}init(ReservationSchedulerConfiguration conf){code} has been implemented already. * Yes, {code}yarnConf{code} doesn't add value here. The reason for including {code} init(ReservationSchedulerConfiguration conf) {code} in IterativePlanner.java is, it extends {code}PlanningAlgorithm{code}, which again has implemented {code}ReservationAgent{code}. Since configuration is available in AlignedPlannerWithGreedy & GreedyReservationAgent classes, thought of passing the same to IterativePlanner.java & TryManyReservationAgents.java as both classes has implemented {code}ReservationAgent{code} interface, but there is no use of it currently. > ReservationAgents are not correctly configured > -- > > Key: YARN-5703 > URL: https://issues.apache.org/jira/browse/YARN-5703 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 3.0.0-alpha1 >Reporter: Sean Po >Assignee: Manikandan R > Attachments: YARN-5703.001.patch, YARN-5703.002.patch, > YARN-5703.003.patch > > > In AbstractReservationSystem, the method that instantiates a ReservationAgent > does not properly initialize it with the appropriate configuration because it > expects the ReservationAgent to implement Configurable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-3571) AM does not re-blacklist NMs after ignoring-blacklist event happens?
[ https://issues.apache.org/jira/browse/YARN-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850161#comment-15850161 ] Manikandan R commented on YARN-3571: I am interested in working on this. Can I work on it? > AM does not re-blacklist NMs after ignoring-blacklist event happens? > > > Key: YARN-3571 > URL: https://issues.apache.org/jira/browse/YARN-3571 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, resourcemanager >Affects Versions: 2.5.1 >Reporter: Hao Zhu > > Detailed analysis are in item "3 Will AM re-blacklist NMs after > ignoring-blacklist event happens?" of below link: > http://www.openkb.info/2015/05/when-will-application-master-blacklist.html > The current behavior is : if that Node Manager has ever been blacklisted > before, then it will not be blacklisted again after ignore-blacklist happens; > Else, it will be blacklisted. > However I think the right behavior should be : AM can re-blacklist NMs even > after ignoring-blacklist happens once. > The code logic is in function containerFailedOnHost(String hostName) of > RMContainerRequestor.java: > {code} > protected void containerFailedOnHost(String hostName) { > if (!nodeBlacklistingEnabled) { > return; > } > if (blacklistedNodes.contains(hostName)) { > if (LOG.isDebugEnabled()) { > LOG.debug("Host " + hostName + " is already blacklisted."); > } > return; //already blacklisted > {code} > The reason of above behavior is in above item 2: when ignoring-blacklist > happens, it only ask RM to clear "blacklistAdditions", however it dose not > clear the "blacklistedNodes" variable. > This behavior may cause the whole job/application to fail if the previous > blacklisted NM was released after ignoring-blacklist event happens. > Imagine a serial murder is released from prison just because the prison is > 33% full, and horribly he/she will never be put in prison again. Only new > murder will be put in prison. > Example to prove: > Test 1: > One node(h4) has issue, other 3 nodes are healthy. > The job failed with below AM logs: > {code} > [root@h1 container_1430425729977_0006_01_01]# egrep -i 'failures on > node|blacklist|FATAL' syslog > 2015-05-02 18:38:41,246 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: > nodeBlacklistingEnabled:true > 2015-05-02 18:38:41,246 INFO [main] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: > blacklistDisablePercent is 1 > 2015-05-02 18:39:07,249 FATAL [IPC Server handler 3 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_02_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:07,297 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 1 failures on > node h4.poc.com > 2015-05-02 18:39:07,950 FATAL [IPC Server handler 16 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_08_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:07,954 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 2 failures on > node h4.poc.com > 2015-05-02 18:39:08,148 FATAL [IPC Server handler 17 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_07_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:08,152 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: 3 failures on > node h4.poc.com > 2015-05-02 18:39:08,152 INFO [Thread-49] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Blacklisted host > h4.poc.com > 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the > blacklist for application_1430425729977_0006: blacklistAdditions=1 > blacklistRemovals=0 > 2015-05-02 18:39:08,561 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Ignore > blacklisting set to true. Known: 4, Blacklisted: 1, 25% > 2015-05-02 18:39:09,563 INFO [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: Update the > blacklist for application_1430425729977_0006: blacklistAdditions=0 > blacklistRemovals=1 > 2015-05-02 18:39:32,912 FATAL [IPC Server handler 19 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_02_1 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:35,076 FATAL [IPC Server handler 1 on 41696] > org.apache.hadoop.mapred.TaskAttemptListenerImpl: Task: > attempt_1430425729977_0006_m_09_0 - exited : java.io.IOException: Spill > failed > 2015-05-02 18:39:35,133 FATAL [IPC Server
[jira] [Commented] (YARN-5179) Issue of CPU usage of containers
[ https://issues.apache.org/jira/browse/YARN-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15850154#comment-15850154 ] Manikandan R commented on YARN-5179: I am interested in working on this. Can I take it forward? > Issue of CPU usage of containers > > > Key: YARN-5179 > URL: https://issues.apache.org/jira/browse/YARN-5179 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 > Environment: Both on Windows and Linux >Reporter: Zhongkai Mi > > // Multiply by 1000 to avoid losing data when converting to int >int milliVcoresUsed = (int) (cpuUsageTotalCoresPercentage * 1000 > * maxVCoresAllottedForContainers /nodeCpuPercentageForYARN); > This formula will not get right CPU usage based vcore if vcores != physical > cores. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5703) ReservationAgents are not correctly configured
[ https://issues.apache.org/jira/browse/YARN-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15848674#comment-15848674 ] Manikandan R commented on YARN-5703: [~Naganarasimha], [~seanpo03] Fixed Greedy RA Junit test case and attached patch for the same. Please review. > ReservationAgents are not correctly configured > -- > > Key: YARN-5703 > URL: https://issues.apache.org/jira/browse/YARN-5703 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 3.0.0-alpha1 >Reporter: Sean Po >Assignee: Manikandan R > Attachments: YARN-5703.001.patch, YARN-5703.002.patch > > > In AbstractReservationSystem, the method that instantiates a ReservationAgent > does not properly initialize it with the appropriate configuration because it > expects the ReservationAgent to implement Configurable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-5703) ReservationAgents are not correctly configured
[ https://issues.apache.org/jira/browse/YARN-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikandan R updated YARN-5703: --- Attachment: YARN-5703.002.patch > ReservationAgents are not correctly configured > -- > > Key: YARN-5703 > URL: https://issues.apache.org/jira/browse/YARN-5703 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 3.0.0-alpha1 >Reporter: Sean Po >Assignee: Manikandan R > Attachments: YARN-5703.001.patch, YARN-5703.002.patch > > > In AbstractReservationSystem, the method that instantiates a ReservationAgent > does not properly initialize it with the appropriate configuration because it > expects the ReservationAgent to implement Configurable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5703) ReservationAgents are not correctly configured
[ https://issues.apache.org/jira/browse/YARN-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15844017#comment-15844017 ] Manikandan R commented on YARN-5703: Thanks [~seanpo03], [~Naganarasimha]. I've explained my approach below based on analysis. Please provide your feedback. By Implementing Configurable interface in RA's - AlignedPlannerWithGreedy & GreedyReservationAgent classes, conf object would get passed from getAgent() method of AbstractReservationSystem.java while creating object for above said RA's. But, to make use of conf to fetch properties like smoothness factor etc, we will require new method (for ex, initialize() ? ) to set RA's vars (for ex, planner) because conf object cannot be used inside constructor as it would get set in setConf() method only after object creation. Since initialization happens in new method, we will need to make "planner" variable as non-final. > ReservationAgents are not correctly configured > -- > > Key: YARN-5703 > URL: https://issues.apache.org/jira/browse/YARN-5703 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 3.0.0-alpha1 >Reporter: Sean Po >Assignee: Manikandan R > > In AbstractReservationSystem, the method that instantiates a ReservationAgent > does not properly initialize it with the appropriate configuration because it > expects the ReservationAgent to implement Configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5703) ReservationAgents are not correctly configured
[ https://issues.apache.org/jira/browse/YARN-5703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15840169#comment-15840169 ] Manikandan R commented on YARN-5703: I am interested in working on this. Shall I take this forward? Thanks, Mani > ReservationAgents are not correctly configured > -- > > Key: YARN-5703 > URL: https://issues.apache.org/jira/browse/YARN-5703 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, resourcemanager >Affects Versions: 3.0.0-alpha1 >Reporter: Sean Po >Assignee: Sean Po > > In AbstractReservationSystem, the method that instantiates a ReservationAgent > does not properly initialize it with the appropriate configuration because it > expects the ReservationAgent to implement Configurable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5900) Configuring minimum-allocation-mb at queue level
Manikandan R created YARN-5900: -- Summary: Configuring minimum-allocation-mb at queue level Key: YARN-5900 URL: https://issues.apache.org/jira/browse/YARN-5900 Project: Hadoop YARN Issue Type: Improvement Reporter: Manikandan R Motivation for proposing minimum-allocation-mb at queue level in the form of yarn.scheduler.capacity..minimum-allocation-mb is, when the queue structure has been designed on resource usages (only memory for now). For example, there could be three segments like small, medium & large jobs and queues can be created for each segment accordingly. With this, it would be good to configure the min container size of each queue separately. For example, For small, it is 1 GB, For medium, it is 3 GB, For large, it can be 6 GB. Would this simplify container release process and its overall management, eventually reduces no. of containers running at any moment? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-5370) Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of OOM
[ https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375339#comment-15375339 ] Manikandan R commented on YARN-5370: To solve this issue, we tried by setting yarn.nodemanager.delete.debug-delay-sec to very low value (zero second) assuming that it may clear off the existing scheduled deletion tasks. It didn't happen - basically it is not applied for the existing tasks which has been already scheduled. Then, we come to know that canRecover() method is getting called in service start, which is trying to pull the info from NM recovery directory (from local filesystem) and building this entire info in memory, which in turn, causing the problems in starting the services and consuming so much amount of memory. Then, we tried by moving the contents of NM recovery directory to some other place. From this points onwards, it was able to start smoothly and works as expected. I think showing some warnings about this high value (for ex, 100+ days) somewhere (for ex, in logs) indicating that it can cause potential crash can saving significant amount of time to troubleshoot this issue. > Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM > because of OOM > > > Key: YARN-5370 > URL: https://issues.apache.org/jira/browse/YARN-5370 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Manikandan R > > I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev > cluster for some reasons. It has been done before 3-4 weeks. After setting > this up, at times, NM crashes because of OOM. So, I kept on increasing from > 512MB to 6 GB over the past few weeks gradually as and when this crash occurs > as temp fix. Sometimes, It won't start smoothly and after multiple tries, it > starts functioning. While analyzing heap dump of corresponding JVM, come to > know that DeletionService.Java is occupying almost 99% of total allocated > memory (-xmx) something like this > org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor > @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13% > Basically, there are huge no. of above mentioned tasks scheduled for > deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. > In my case, cluster is very small and OOM occurs. > Is it expected behaviour? (or) Is there any limit we can expose on > yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-5370) Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of OOM
Manikandan R created YARN-5370: -- Summary: Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of OOM Key: YARN-5370 URL: https://issues.apache.org/jira/browse/YARN-5370 Project: Hadoop YARN Issue Type: Bug Reporter: Manikandan R I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev cluster for some reasons. It has been done before 3-4 weeks. After setting this up, at times, NM crashes because of OOM. So, I kept on increasing from 512MB to 6 GB over the past few weeks gradually as and when this crash occurs as temp fix. Sometimes, It won't start smoothly and after multiple tries, it starts functioning. While analyzing heap dump of corresponding JVM, come to know that DeletionService.Java is occupying almost 99% of total allocated memory (-xmx) something like this org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13% Basically, there are huge no. of above mentioned tasks scheduled for deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. In my case, cluster is very small and OOM occurs. Is it expected behaviour? (or) Is there any limit we can expose on yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org