[jira] [Commented] (YARN-6740) Federation Router (hiding multiple RMs for ApplicationClientProtocol) phase 2
[ https://issues.apache.org/jira/browse/YARN-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875351#comment-16875351 ] hunshenshi commented on YARN-6740: -- Thanks [~abmodi] [~giovanni.fumarola] > Federation Router (hiding multiple RMs for ApplicationClientProtocol) phase 2 > - > > Key: YARN-6740 > URL: https://issues.apache.org/jira/browse/YARN-6740 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Abhishek Modi >Priority: Major > > This JIRA tracks the implementation of the layer for routing > ApplicaitonClientProtocol requests to the appropriate RM(s) in a federated > YARN cluster. > Under the YARN-3659 we only implemented getNewApplication, submitApplication, > forceKillApplication and getApplicationReport to execute applications E2E. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875350#comment-16875350 ] hunshenshi commented on YARN-9655: -- Sure,I add a UT in TestFederationInterceptor#testAllocateResponse。 Thanks for review [~cheersyang] > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9562) Add Java changes for the new RuncContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875283#comment-16875283 ] Eric Badger commented on YARN-9562: --- Attaching patch 001 as an initial patch to give everyone a sense of how the patch will look for the most part. Currently, the {{RuncContainerRuntime}} is using Docker configs, but we have decided to split the config into docker and runc separately. So this will need to change in future versions of the patch. Additionally, there are currently no unit tests. An entire suite of new tests will need to be written before this can be committed. Because of this, I'm not going to submit the patch until a later revision. Other than that, I'm happy to hear feedback from others while I fix up the patch, add unit tests, and etc. To try this out (along with the C code changes from YARN-9561), you will need to create squashfs layers from all of the layers of a docker image and upload them to the layers directory specified by the configs. The image config will go in the config directory, and the the manifest in the manifests directory. There is also some magic that needs to be done in relation to whiteout and opaque files in the docker image, but you can probably get your image to run without dealing with those. I have a tool that does the whole conversion, but that isn't yet ready to put up for review because there are some bits of code that rely on internal changes that haven't been made to the apache codebase. If you'd like, I could try and put that up before focusing on the unit tests for this JIRA as well as YARN-9561. > Add Java changes for the new RuncContainerRuntime > - > > Key: YARN-9562 > URL: https://issues.apache.org/jira/browse/YARN-9562 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9562.001.patch > > > This JIRA will be used to add the Java changes for the new > RuncContainerRuntime. This will work off of YARN-9560 to use much of the > existing DockerLinuxContainerRuntime code once it is moved up into an > abstract class that can be extended. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9562) Add Java changes for the new RuncContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9562: -- Attachment: YARN-9562.001.patch > Add Java changes for the new RuncContainerRuntime > - > > Key: YARN-9562 > URL: https://issues.apache.org/jira/browse/YARN-9562 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Attachments: YARN-9562.001.patch > > > This JIRA will be used to add the Java changes for the new > RuncContainerRuntime. This will work off of YARN-9560 to use much of the > existing DockerLinuxContainerRuntime code once it is moved up into an > abstract class that can be extended. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875252#comment-16875252 ] Eric Badger commented on YARN-9560: --- Thanks [~eyang], [~Jim_Brennan], [~ccondit], [~shaneku...@gmail.com] for the patience and help with this patch! I'll put up an initial patch for YARN-9562 soon > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Fix For: 3.3.0 > > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875235#comment-16875235 ] Hudson commented on YARN-9560: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16838 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16838/]) YARN-9560. Restructure DockerLinuxContainerRuntime to extend (eyang: rev 29465bf169a7e348a4f32265083450faf66d5631) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/resources/gpu/GpuResourceHandlerImpl.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerCleanup.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/OCIContainerRuntime.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/TestDockerContainerRuntime.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/deviceframework/DeviceResourceHandlerImpl.java > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875217#comment-16875217 ] Eric Yang commented on YARN-9560: - +1 on patch 013. Will commit shortly. > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9581) Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2
[ https://issues.apache.org/jira/browse/YARN-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875157#comment-16875157 ] Eric Yang edited comment on YARN-9581 at 6/28/19 8:37 PM: -- [~Prabhu Joseph] Thank you for the patch. I just committed addendum patch 001 to trunk and branch-3.2. Fixed version remains the same. was (Author: eyang): [~Prabhu Joseph] Thank you for the patch. I just committed addendum patch 001 to trunk. > Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2 > > > Key: YARN-9581 > URL: https://issues.apache.org/jira/browse/YARN-9581 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9581-001.patch, YARN-9581-002.patch, > YARN-9581-003.patch, YARN-9581-004.patch, YARN-9581-005.patch, > YARN-9581-006.patch, YARN-9581-007.patch, YARN-9581.addendum-001.patch > > > Yarn Logs fails for a running job in case of RM HA with rm2 active and rm1 is > down. > {code} > hrt_qa@prabhuYarn:~> /usr/hdp/current/hadoop-yarn-client/bin/yarn logs > -applicationId application_1558613472348_0004 -am 1 > 19/05/24 18:04:49 INFO client.AHSProxy: Connecting to Application History > server at prabhuYarn/172.27.23.55:10200 > 19/05/24 18:04:50 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Unable to get AM container informations for the > application:application_1558613472348_0004 > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > Error while authenticating with endpoint: > https://prabhuYarn:8090/ws/v1/cluster/apps/application_1558613472348_0004/appattempts > Can not get AMContainers logs for the > application:application_1558613472348_0004 with the appOwner:hrt_qa > {code} > LogsCli getRMWebAppURLWithoutScheme only checks the first one from the RM > list yarn.resourcemanager.ha.rm-ids. > {code} > yarnConfig.set(YarnConfiguration.RM_HA_ID, rmIds.get(0)); > {code} > SchedConfCli also fails > {code} > [ambari-qa@pjosephdocker-3 ~]$ yarn schedulerconf -update > root.default:maximum-capacity=90 > Exception in thread "main" com.sun.jersey.api.client.ClientHandlerException: > java.net.ConnectException: Connection refused (Connection refused) > at > com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:155) > at com.sun.jersey.api.client.Client.handle(Client.java:652) > at com.sun.jersey.api.client.WebResource.handle(WebResource.java:682) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9581) Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2
[ https://issues.apache.org/jira/browse/YARN-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875156#comment-16875156 ] Hudson commented on YARN-9581: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16836 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16836/]) YARN-9581. Add support for get multiple RM webapp URLs.(eyang: rev f02b0e19940dc6fc1e19258a40db37d1eed89d21) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/webapp/util/WebAppUtils.java > Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2 > > > Key: YARN-9581 > URL: https://issues.apache.org/jira/browse/YARN-9581 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9581-001.patch, YARN-9581-002.patch, > YARN-9581-003.patch, YARN-9581-004.patch, YARN-9581-005.patch, > YARN-9581-006.patch, YARN-9581-007.patch, YARN-9581.addendum-001.patch > > > Yarn Logs fails for a running job in case of RM HA with rm2 active and rm1 is > down. > {code} > hrt_qa@prabhuYarn:~> /usr/hdp/current/hadoop-yarn-client/bin/yarn logs > -applicationId application_1558613472348_0004 -am 1 > 19/05/24 18:04:49 INFO client.AHSProxy: Connecting to Application History > server at prabhuYarn/172.27.23.55:10200 > 19/05/24 18:04:50 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Unable to get AM container informations for the > application:application_1558613472348_0004 > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > Error while authenticating with endpoint: > https://prabhuYarn:8090/ws/v1/cluster/apps/application_1558613472348_0004/appattempts > Can not get AMContainers logs for the > application:application_1558613472348_0004 with the appOwner:hrt_qa > {code} > LogsCli getRMWebAppURLWithoutScheme only checks the first one from the RM > list yarn.resourcemanager.ha.rm-ids. > {code} > yarnConfig.set(YarnConfiguration.RM_HA_ID, rmIds.get(0)); > {code} > SchedConfCli also fails > {code} > [ambari-qa@pjosephdocker-3 ~]$ yarn schedulerconf -update > root.default:maximum-capacity=90 > Exception in thread "main" com.sun.jersey.api.client.ClientHandlerException: > java.net.ConnectException: Connection refused (Connection refused) > at > com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:155) > at com.sun.jersey.api.client.Client.handle(Client.java:652) > at com.sun.jersey.api.client.WebResource.handle(WebResource.java:682) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9581) Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2
[ https://issues.apache.org/jira/browse/YARN-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875153#comment-16875153 ] Eric Yang commented on YARN-9581: - +1 for addendum patch 001. > Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2 > > > Key: YARN-9581 > URL: https://issues.apache.org/jira/browse/YARN-9581 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9581-001.patch, YARN-9581-002.patch, > YARN-9581-003.patch, YARN-9581-004.patch, YARN-9581-005.patch, > YARN-9581-006.patch, YARN-9581-007.patch, YARN-9581.addendum-001.patch > > > Yarn Logs fails for a running job in case of RM HA with rm2 active and rm1 is > down. > {code} > hrt_qa@prabhuYarn:~> /usr/hdp/current/hadoop-yarn-client/bin/yarn logs > -applicationId application_1558613472348_0004 -am 1 > 19/05/24 18:04:49 INFO client.AHSProxy: Connecting to Application History > server at prabhuYarn/172.27.23.55:10200 > 19/05/24 18:04:50 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Unable to get AM container informations for the > application:application_1558613472348_0004 > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > Error while authenticating with endpoint: > https://prabhuYarn:8090/ws/v1/cluster/apps/application_1558613472348_0004/appattempts > Can not get AMContainers logs for the > application:application_1558613472348_0004 with the appOwner:hrt_qa > {code} > LogsCli getRMWebAppURLWithoutScheme only checks the first one from the RM > list yarn.resourcemanager.ha.rm-ids. > {code} > yarnConfig.set(YarnConfiguration.RM_HA_ID, rmIds.get(0)); > {code} > SchedConfCli also fails > {code} > [ambari-qa@pjosephdocker-3 ~]$ yarn schedulerconf -update > root.default:maximum-capacity=90 > Exception in thread "main" com.sun.jersey.api.client.ClientHandlerException: > java.net.ConnectException: Connection refused (Connection refused) > at > com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:155) > at com.sun.jersey.api.client.Client.handle(Client.java:652) > at com.sun.jersey.api.client.WebResource.handle(WebResource.java:682) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875124#comment-16875124 ] Jim Brennan commented on YARN-9560: --- Thanks for all the updates [~ebadger]! I am also +1 on patch 013 (non-binding). > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9656) Plugin to avoid scheduling jobs on node which are not in "schedulable" state, but are healthy otherwise.
[ https://issues.apache.org/jira/browse/YARN-9656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Golash updated YARN-9656: -- Affects Version/s: 2.9.1 > Plugin to avoid scheduling jobs on node which are not in "schedulable" state, > but are healthy otherwise. > > > Key: YARN-9656 > URL: https://issues.apache.org/jira/browse/YARN-9656 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager, resourcemanager >Affects Versions: 2.9.1, 3.1.2 >Reporter: Prashant Golash >Priority: Major > > Creating this Jira to get idea from the community if this is something > helpful which can be done in YARN. Some times the nodes go in a bad state for > e.g. (H/W problem: I/O is bad; Fan problem). In some other scenarios, if > CGroup is not enabled, nodes may be running very high on CPU and the jobs > scheduled on them will suffer. > > The idea is three-fold: > # Gather relevant metrics from node-managers and put in some form (for e.g. > exclude file). > # RM loads the files and put the nodes as part of the blacklist. > # Once the node becomes good, they can again be put in the whitelist. > Various optimizations can be done here, but I would like to understand if > this is something which could be helpful as an upstream feature in YARN. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9656) Plugin to avoid scheduling jobs on node which are not in "schedulable" state, but are healthy otherwise.
Prashant Golash created YARN-9656: - Summary: Plugin to avoid scheduling jobs on node which are not in "schedulable" state, but are healthy otherwise. Key: YARN-9656 URL: https://issues.apache.org/jira/browse/YARN-9656 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager Affects Versions: 3.1.2 Reporter: Prashant Golash Creating this Jira to get idea from the community if this is something helpful which can be done in YARN. Some times the nodes go in a bad state for e.g. (H/W problem: I/O is bad; Fan problem). In some other scenarios, if CGroup is not enabled, nodes may be running very high on CPU and the jobs scheduled on them will suffer. The idea is three-fold: # Gather relevant metrics from node-managers and put in some form (for e.g. exclude file). # RM loads the files and put the nodes as part of the blacklist. # Once the node becomes good, they can again be put in the whitelist. Various optimizations can be done here, but I would like to understand if this is something which could be helpful as an upstream feature in YARN. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6740) Federation Router (hiding multiple RMs for ApplicationClientProtocol) phase 2
[ https://issues.apache.org/jira/browse/YARN-6740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875094#comment-16875094 ] Giovanni Matteo Fumarola commented on YARN-6740: Only 6-7 methods got implemented in the several jiras. There are still methods that need to be implemented. > Federation Router (hiding multiple RMs for ApplicationClientProtocol) phase 2 > - > > Key: YARN-6740 > URL: https://issues.apache.org/jira/browse/YARN-6740 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Giovanni Matteo Fumarola >Assignee: Abhishek Modi >Priority: Major > > This JIRA tracks the implementation of the layer for routing > ApplicaitonClientProtocol requests to the appropriate RM(s) in a federated > YARN cluster. > Under the YARN-3659 we only implemented getNewApplication, submitApplication, > forceKillApplication and getApplicationReport to execute applications E2E. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875089#comment-16875089 ] Shane Kumpf commented on YARN-9560: --- Thanks for the patch and explanation, [~ebadger]. It is a similar pattern to what we do in the delegating runtime. I tested out the patch and it looks good to me. The unit test failing looks to be unrelated. I'm +1 on patch 013. > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875081#comment-16875081 ] Hadoop QA commented on YARN-9560: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 51s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 0 new + 22 unchanged - 2 fixed = 22 total (was 24) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 31s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 21m 4s{color} | {color:red} hadoop-yarn-server-nodemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 72m 5s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9560 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12973188/YARN-9560.013.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 8d2acf393598 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / cbae241 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_212 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-YARN-Build/24334/artifact/out/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/24334/testReport/ | | Max. process+thread count | 413 (vs. ulimit of 1) | | modules | C:
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875040#comment-16875040 ] Hudson commented on YARN-9655: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16833 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16833/]) Revert "YARN-9655. AllocateResponse in FederationInterceptor lost (wwei: rev f09c31a97e1646a1089e87d859040ebfe0c047f5) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875037#comment-16875037 ] Weiwei Yang commented on YARN-9655: --- Oops. The fix was simple and that makes me ignore there is no UT for this, let me revert the commit for now. [~hunhun] can you help to add a UT to cover this NPE issue? Thanks > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875035#comment-16875035 ] Weiwei Yang commented on YARN-9655: --- +1. committing shortly. > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875030#comment-16875030 ] Hudson commented on YARN-9655: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16832 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16832/]) YARN-9655. AllocateResponse in FederationInterceptor lost (wwei: rev 5e7caf128719aac7d16d0efc8334b3b5a4b01e89) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/amrmproxy/FederationInterceptor.java > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875020#comment-16875020 ] Craig Condit commented on YARN-9560: +1 on patch 013 (non-binding). > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875018#comment-16875018 ] Eric Badger commented on YARN-9560: --- Patch 013 addresses checkstyle in the meantime > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Badger updated YARN-9560: -- Attachment: YARN-9560.013.patch > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch, YARN-9560.013.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9560) Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime
[ https://issues.apache.org/jira/browse/YARN-9560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875010#comment-16875010 ] Eric Badger commented on YARN-9560: --- The reason for having the {{isDockerContainerRequested()}} in {{OCIContainerRuntime}} is so that we can tell if either a Docker container _or_ a Runc container was requested. The classes that call {{isOCICompliantContainerRequested()}} are calling it from a static context. They don't know whether the container is docker, runc, or something else. For now since there are only Docker containers, the logic of {{isDockerContainerRequested()}} is identical to {{isOCICompliantContainerRequested()}}. However, once Runc is added, the logic of {{isOCICompliantContainerRequsted()}} will be a logical OR of {{isDockerContainerRequsted()}} and {{isRuncContainerRequested()}}. I thought this would be cleaner than changing all of the invocations of {{isDockerContainerRequested()}} and making those logical ORs. And to me it makes more sense to let the subclasses define the logic around whether a docker container or runc container is requested. If you have a betterr idea, let me know. > Restructure DockerLinuxContainerRuntime to extend a new OCIContainerRuntime > --- > > Key: YARN-9560 > URL: https://issues.apache.org/jira/browse/YARN-9560 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-9560.001.patch, YARN-9560.002.patch, > YARN-9560.003.patch, YARN-9560.004.patch, YARN-9560.005.patch, > YARN-9560.006.patch, YARN-9560.007.patch, YARN-9560.008.patch, > YARN-9560.009.patch, YARN-9560.010.patch, YARN-9560.011.patch, > YARN-9560.012.patch > > > Since the new RuncContainerRuntime will be using a lot of the same code as > DockerLinuxContainerRuntime, it would be good to move a bunch of the > DockerLinuxContainerRuntime code up a level to an abstract class that both of > the runtimes can extend. > The new structure will look like: > {noformat} > OCIContainerRuntime (abstract class) > - DockerLinuxContainerRuntime > - RuncContainerRuntime > {noformat} > This JIRA should only change the structure of the code, not the actual > semantics -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9655) AllocateResponse in FederationInterceptor lost applicationPriority
[ https://issues.apache.org/jira/browse/YARN-9655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang reassigned YARN-9655: - Assignee: hunshenshi > AllocateResponse in FederationInterceptor lost applicationPriority > --- > > Key: YARN-9655 > URL: https://issues.apache.org/jira/browse/YARN-9655 > Project: Hadoop YARN > Issue Type: Bug > Components: federation >Affects Versions: 3.2.0 >Reporter: hunshenshi >Assignee: hunshenshi >Priority: Major > > In YARN Federation mode using FederationInterceptor, when submitting > application, am will report an error. > {code:java} > 2019-06-25 11:44:00,977 ERROR [RMCommunicator Allocator] > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator: ERROR IN CONTACTING RM. > java.lang.NullPointerException at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.handleJobPriorityChange(RMContainerAllocator.java:1025) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.getResources(RMContainerAllocator.java:880) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:286) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$AllocatorRunnable.run(RMCommunicator.java:280) > at java.lang.Thread.run(Thread.java:748) > {code} > The reason is that applicationPriority is lost. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874994#comment-16874994 ] Weiwei Yang commented on YARN-9623: --- Pushed to trunk, thanks for the contribution [~Tao Yang]. > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874995#comment-16874995 ] Hudson commented on YARN-9623: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #16831 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/16831/]) YARN-9623. Auto adjust max queue length of app activities to make sure (wwei: rev cbae2413201bc470b5f16421ea69d1cd9edb64a8) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/TestActivitiesManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Fix For: 3.3.0 > > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874991#comment-16874991 ] Weiwei Yang commented on YARN-9623: --- +1, committing shortly. > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874917#comment-16874917 ] Hadoop QA commented on YARN-9623: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 5m 12s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 8s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 44s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 38s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 48s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 56s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 3m 46s{color} | {color:green} hadoop-yarn-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 84m 48s{color} | {color:red} hadoop-yarn-server-resourcemanager in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 42s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}176m 56s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:bdbca0e | | JIRA Issue | YARN-9623 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12973152/YARN-9623.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 58c6b042f9e0 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC
[jira] [Commented] (YARN-9581) Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2
[ https://issues.apache.org/jira/browse/YARN-9581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874881#comment-16874881 ] Prabhu Joseph commented on YARN-9581: - [~eyang] This Jira fixes two rms in case of HA. Have Submitted an addendum patch for multiple RMs. Can you review the [^YARN-9581.addendum-001.patch] when you get time. > Fix WebAppUtils#getRMWebAppURLWithScheme ignores rm2 > > > Key: YARN-9581 > URL: https://issues.apache.org/jira/browse/YARN-9581 > Project: Hadoop YARN > Issue Type: Bug > Components: client >Affects Versions: 3.2.0 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Fix For: 3.3.0, 3.2.1 > > Attachments: YARN-9581-001.patch, YARN-9581-002.patch, > YARN-9581-003.patch, YARN-9581-004.patch, YARN-9581-005.patch, > YARN-9581-006.patch, YARN-9581-007.patch, YARN-9581.addendum-001.patch > > > Yarn Logs fails for a running job in case of RM HA with rm2 active and rm1 is > down. > {code} > hrt_qa@prabhuYarn:~> /usr/hdp/current/hadoop-yarn-client/bin/yarn logs > -applicationId application_1558613472348_0004 -am 1 > 19/05/24 18:04:49 INFO client.AHSProxy: Connecting to Application History > server at prabhuYarn/172.27.23.55:10200 > 19/05/24 18:04:50 INFO client.ConfiguredRMFailoverProxyProvider: Failing over > to rm2 > Unable to get AM container informations for the > application:application_1558613472348_0004 > java.io.IOException: > org.apache.hadoop.security.authentication.client.AuthenticationException: > Error while authenticating with endpoint: > https://prabhuYarn:8090/ws/v1/cluster/apps/application_1558613472348_0004/appattempts > Can not get AMContainers logs for the > application:application_1558613472348_0004 with the appOwner:hrt_qa > {code} > LogsCli getRMWebAppURLWithoutScheme only checks the first one from the RM > list yarn.resourcemanager.ha.rm-ids. > {code} > yarnConfig.set(YarnConfiguration.RM_HA_ID, rmIds.get(0)); > {code} > SchedConfCli also fails > {code} > [ambari-qa@pjosephdocker-3 ~]$ yarn schedulerconf -update > root.default:maximum-capacity=90 > Exception in thread "main" com.sun.jersey.api.client.ClientHandlerException: > java.net.ConnectException: Connection refused (Connection refused) > at > com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:155) > at com.sun.jersey.api.client.Client.handle(Client.java:652) > at com.sun.jersey.api.client.WebResource.handle(WebResource.java:682) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9625) UI2 - No link to a queue on the Queues page for Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-9625: Assignee: Zoltan Siegl > UI2 - No link to a queue on the Queues page for Fair Scheduler > -- > > Key: YARN-9625 > URL: https://issues.apache.org/jira/browse/YARN-9625 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Charan Hebri >Assignee: Zoltan Siegl >Priority: Major > Attachments: Capacity_scheduler_page.png, Fair_scheduler_page.png > > > When the scheduler is set as 'Capacity Scheduler' the Queues page has a tab > on the right with a link to a certain queue which provides running app > information for the queue. But for 'Fair Scheduler' there is no such link. > Attached screenshots for both schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874840#comment-16874840 ] Tao Yang commented on YARN-7621: Attached v2 patch rebased from trunk. [~cheersyang], could you please help to review this patch? Thanks. > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7621) Support submitting apps with queue path for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-7621: --- Attachment: YARN-7621.002.patch > Support submitting apps with queue path for CapacityScheduler > - > > Key: YARN-7621 > URL: https://issues.apache.org/jira/browse/YARN-7621 > Project: Hadoop YARN > Issue Type: Improvement > Components: capacityscheduler >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Minor > Attachments: YARN-7621.001.patch, YARN-7621.002.patch > > > Currently there is a difference of queue definition in > ApplicationSubmissionContext between CapacityScheduler and FairScheduler. > FairScheduler needs queue path but CapacityScheduler needs queue name. There > is no doubt of the correction of queue definition for CapacityScheduler > because it does not allow duplicate leaf queue names, but it's hard to switch > between FairScheduler and CapacityScheduler. I propose to support submitting > apps with queue path for CapacityScheduler to make the interface clearer and > scheduler switch smoothly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874825#comment-16874825 ] Tao Yang commented on YARN-9623: Thanks [~cheersyang] for your comments. {quote} If this configuration is set, then the value should be enforced for the queue size and disable the auto-adjustment. Can you add that logic? {quote} Currently configuration {{yarn.resourcemanager.activities-manager.app-activities.max-queue-length}} is still there and can be seem as the lowest limit, max queue length of app activities can only be updated to a larger value than that value. I think this should make sense to us as well. Attached v2 patch with adding volatile modifier for appActivitiesMaxQueueLength to make it to be seen by other threads as soon as possible. > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874001#comment-16874001 ] Tao Yang edited comment on YARN-9623 at 6/28/19 10:07 AM: -- Thanks [~cheersyang] for the feedback. {quote} However, the activity manager should be a general service, it should not be depending on CS's configuration. {quote} Yes, I had this concern before, but required number of app activities is indeed decided by a specific scheduler and even a specific scheduling policy inside the scheduler. So the patch did the same as some general services like QueueACLsManager/SchedulerPlacementProcessor/... (using {{if scheduler instanceof CapacityScheduler}}). The specific scheduler can be ignored unless we just set maxQueueLength to max(configuredMaxQueueLength, 1.2 * numOfNodes), this may somehow waste a lot in a large cluster with multi-nodes placement enabled. Thoughts? {quote} Another thing is appActivitiesMaxQueueLength, do we need to make it atomic because it is being modified in another thread. {quote} It's no need to make it atomic since there's no requirements for sequence or consistency, but volatile is necessary for this variable. was (Author: tao yang): Thanks [~cheersyang] for the feedback. {quote} However, the activity manager should be a general service, it should not be depending on CS's configuration. {quote} Yes, I had this concern before, but required number of app activities is indeed decided by a specific scheduler and even a specific scheduling policy inside the scheduler. So the patch did the same as some general services like QueueACLsManager/SchedulerPlacementProcessor/... (using {{if scheduler instanceof CapacityScheduler}}). The specific scheduler can be ignored unless we just set maxQueueLength to max(configuredMaxQueueLength, 1.2 * numOfNodes), this may somehow waste a lot in a large cluster with multi-nodes placement enabled. Thoughts? {quote} Another thing is appActivitiesMaxQueueLength, do we need to make it atomic because it is being modified in another thread. {quote} It's no need to make it atomic since there's no requirements for sequence or consistency, but violate is necessary for this variable. > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9623) Auto adjust max queue length of app activities to make sure activities on all nodes can be covered
[ https://issues.apache.org/jira/browse/YARN-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9623: --- Attachment: YARN-9623.002.patch > Auto adjust max queue length of app activities to make sure activities on all > nodes can be covered > -- > > Key: YARN-9623 > URL: https://issues.apache.org/jira/browse/YARN-9623 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9623.001.patch, YARN-9623.002.patch > > > Currently we can use configuration entry > "yarn.resourcemanager.activities-manager.app-activities.max-queue-length" to > control max queue length of app activities, but in some scenarios , this > configuration may need to be updated in a growing cluster. Moreover, it's > better for users to ignore that conf therefor it should be auto adjusted > internally. > There are some differences among different scheduling modes: > * multi-node placement disabled > ** Heartbeat driven scheduling: max queue length of app activities should > not less than the number of nodes, considering nodes can not be always in > order, we should make some room for misorder, for example, we can guarantee > that max queue length should not be less than 1.2 * numNodes > ** Async scheduling: every async scheduling thread goes through all nodes in > order, in this mode, we should guarantee that max queue length should be > numThreads * numNodes. > * multi-node placement enabled: activities on all nodes can be involved in a > single app allocation, therefor there's no need to adjust for this mode. > To sum up, we can adjust the max queue length of app activities like this: > {code} > int configuredMaxQueueLength; > int maxQueueLength; > serviceInit(){ > ... > configuredMaxQueueLength = ...; //read configured max queue length > maxQueueLength = configuredMaxQueueLength; //take configured value as > default > } > CleanupThread#run(){ > ... > if (multiNodeDisabled) { > if (asyncSchedulingEnabled) { >maxQueueLength = max(configuredMaxQueueLength, numSchedulingThreads * > numNodes); > } else { >maxQueueLength = max(configuredMaxQueueLength, 1.2 * numNodes); > } > } else if (maxQueueLength != configuredMaxQueueLength) { > maxQueueLength = configuredMaxQueueLength; > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9625) UI2 - No link to a queue on the Queues page for Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-9625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-9625: Assignee: (was: Szilard Nemeth) > UI2 - No link to a queue on the Queues page for Fair Scheduler > -- > > Key: YARN-9625 > URL: https://issues.apache.org/jira/browse/YARN-9625 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Charan Hebri >Priority: Major > Attachments: Capacity_scheduler_page.png, Fair_scheduler_page.png > > > When the scheduler is set as 'Capacity Scheduler' the Queues page has a tab > on the right with a link to a certain queue which provides running app > information for the queue. But for 'Fair Scheduler' there is no such link. > Attached screenshots for both schedulers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9626) UI2 - Fair scheduler queue apps page issues
[ https://issues.apache.org/jira/browse/YARN-9626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-9626: Assignee: (was: Szilard Nemeth) > UI2 - Fair scheduler queue apps page issues > --- > > Key: YARN-9626 > URL: https://issues.apache.org/jira/browse/YARN-9626 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Charan Hebri >Priority: Major > Attachments: Fair_scheduler_apps_page.png > > > There are a few issues with the apps page for a queue when Fair Scheduler is > used. > * Labels like configured capacity, configured max capacity etc. (marked in > the attached image) are not needed as they are specific to Capacity Scheduler. > * Steady fair memory, used memory and maximum memory are actual values but > are shown as percentages. > * Formatting of Pending, Allocated, Reserved Containers values is not > correct (shown in the attached screenshot) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9640) Slow event processing could cause too many attempt unregister events
[ https://issues.apache.org/jira/browse/YARN-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874723#comment-16874723 ] Zhankun Tang commented on YARN-9640: [~bibinchundatt], yeah. agree. > Slow event processing could cause too many attempt unregister events > > > Key: YARN-9640 > URL: https://issues.apache.org/jira/browse/YARN-9640 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Labels: scalability > Attachments: YARN-9640.001.patch, YARN-9640.002.patch, > YARN-9640.003.patch > > > We found in one of our test cluster verification that the number attempt > unregister events is about 300k+. > # AM all containers completed. > # AMRMClientImpl send finishApplcationMaster > # AMRMClient check event 100ms the finish Status using > finishApplicationMaster request. > # AMRMClientImpl#unregisterApplicationMaster > {code:java} > while (true) { > FinishApplicationMasterResponse response = > rmClient.finishApplicationMaster(request); > if (response.getIsUnregistered()) { > break; > } > LOG.info("Waiting for application to be successfully unregistered."); > Thread.sleep(100); > } > {code} > # ApplicationMasterService finishApplicationMaster interface sends > unregister events on every status update. > We should send unregister event only once and cache event send , ignore and > send not unregistered response back to AM not overloading the event queue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9480) createAppDir() in LogAggregationService shouldn't block dispatcher thread of ContainerManagerImpl
[ https://issues.apache.org/jira/browse/YARN-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874719#comment-16874719 ] liyakun commented on YARN-9480: --- [~tangzhankun] please help to make [~Yunyao Zhang] as a contributor, and he will contribute to this issue. > createAppDir() in LogAggregationService shouldn't block dispatcher thread of > ContainerManagerImpl > - > > Key: YARN-9480 > URL: https://issues.apache.org/jira/browse/YARN-9480 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: liyakun >Assignee: liyakun >Priority: Major > > At present, when startContainers(), if NM does not contain the application, > it will enter the step of INIT_APPLICATION. In the application init step, > createAppDir() will be executed, and it is a blocking operation. > createAppDir() is an operation that needs to interact with an external file > system. This operation is affected by the SLA of the external file system. > Once the external file system has a high latency, the NM dispatcher thread of > ContainerManagerImpl will be stuck. (In fact, I have seen a scene that NM > stuck here for more than an hour.) > I think it would be more reasonable to move createAppDir() to the actual time > of uploading log (in other threads). And according to the logRetentionPolicy, > many of the containers may not get to this step, which will save a lot of > interactions with external file system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9480) createAppDir() in LogAggregationService shouldn't block dispatcher thread of ContainerManagerImpl
[ https://issues.apache.org/jira/browse/YARN-9480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874717#comment-16874717 ] Yunyao Zhang commented on YARN-9480: please assign to me. > createAppDir() in LogAggregationService shouldn't block dispatcher thread of > ContainerManagerImpl > - > > Key: YARN-9480 > URL: https://issues.apache.org/jira/browse/YARN-9480 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager >Reporter: liyakun >Assignee: liyakun >Priority: Major > > At present, when startContainers(), if NM does not contain the application, > it will enter the step of INIT_APPLICATION. In the application init step, > createAppDir() will be executed, and it is a blocking operation. > createAppDir() is an operation that needs to interact with an external file > system. This operation is affected by the SLA of the external file system. > Once the external file system has a high latency, the NM dispatcher thread of > ContainerManagerImpl will be stuck. (In fact, I have seen a scene that NM > stuck here for more than an hour.) > I think it would be more reasonable to move createAppDir() to the actual time > of uploading log (in other threads). And according to the logRetentionPolicy, > many of the containers may not get to this step, which will save a lot of > interactions with external file system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org