[jira] [Commented] (YARN-8468) Enable the use of queue based maximum container allocation limit and implement it in FairScheduler
[ https://issues.apache.org/jira/browse/YARN-8468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651405#comment-16651405 ] Weiwei Yang commented on YARN-8468: --- Still same issue on branch-3.1, but recent trunk build seems fine. I gave up on committing this to branch-3.1 :( > Enable the use of queue based maximum container allocation limit and > implement it in FairScheduler > -- > > Key: YARN-8468 > URL: https://issues.apache.org/jira/browse/YARN-8468 > Project: Hadoop YARN > Issue Type: Improvement > Components: fairscheduler, scheduler >Affects Versions: 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Critical > Fix For: 3.2.0 > > Attachments: YARN-8468-branch-3.1.018.patch, > YARN-8468-branch-3.1.019.patch, YARN-8468-branch-3.1.020.patch, > YARN-8468-branch-3.1.021.patch, YARN-8468-branch-3.1.022.patch, > YARN-8468.000.patch, YARN-8468.001.patch, YARN-8468.002.patch, > YARN-8468.003.patch, YARN-8468.004.patch, YARN-8468.005.patch, > YARN-8468.006.patch, YARN-8468.007.patch, YARN-8468.008.patch, > YARN-8468.009.patch, YARN-8468.010.patch, YARN-8468.011.patch, > YARN-8468.012.patch, YARN-8468.013.patch, YARN-8468.014.patch, > YARN-8468.015.patch, YARN-8468.016.patch, YARN-8468.017.patch, > YARN-8468.018.patch > > > When using any scheduler, you can use "yarn.scheduler.maximum-allocation-mb" > to limit the overall size of a container. This applies globally to all > containers and cannot be limited by queue or and is not scheduler dependent. > The goal of this ticket is to allow this value to be set on a per queue basis. > The use case: User has two pools, one for ad hoc jobs and one for enterprise > apps. User wants to limit ad hoc jobs to small containers but allow > enterprise apps to request as many resources as needed. Setting > yarn.scheduler.maximum-allocation-mb sets a default value for maximum > container size for all queues and setting maximum resources per queue with > “maxContainerResources” queue config value. > Suggested solution: > All the infrastructure is already in the code. We need to do the following: > * add the setting to the queue properties for all queue types (parent and > leaf), this will cover dynamically created queues. > * if we set it on the root we override the scheduler setting and we should > not allow that. > * make sure that queue resource cap can not be larger than scheduler max > resource cap in the config. > * implement getMaximumResourceCapability(String queueName) in the > FairScheduler > * implement getMaximumResourceCapability(String queueName) in both > FSParentQueue and FSLeafQueue as follows > * expose the setting in the queue information in the RM web UI. > * expose the setting in the metrics etc for the queue. > * Enforce the use of queue based maximum allocation limit if it is > available, if not use the general scheduler level setting > ** Use it during validation and normalization of requests in > scheduler.allocate, app submit and resource request -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8877) Extend service spec to allow setting resource attributes
[ https://issues.apache.org/jira/browse/YARN-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651500#comment-16651500 ] Hadoop QA commented on YARN-8877: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 28s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 4 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 7s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 17s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 7m 9s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 7m 9s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 1m 8s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn: The patch generated 1 new + 42 unchanged - 0 fixed = 43 total (was 42) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 26s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 41s{color} | {color:green} hadoop-yarn-api in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 16s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 29s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 82m 18s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8877 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944076/YARN-8877.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux b79000dfed0a 4.4.0-133-generic #159-Ubuntu SMP Fri Aug 10 07:31:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0bf8a11 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | checkstyle |
[jira] [Assigned] (YARN-7756) AMRMProxyService cann't enable ’hadoop.security.authorization‘
[ https://issues.apache.org/jira/browse/YARN-7756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] leiqiang reassigned YARN-7756: -- Assignee: leiqiang > AMRMProxyService cann't enable ’hadoop.security.authorization‘ > -- > > Key: YARN-7756 > URL: https://issues.apache.org/jira/browse/YARN-7756 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.9.0, 3.0.0 >Reporter: leiqiang >Assignee: leiqiang >Priority: Major > Attachments: YARN-7756.v0.patch, YARN-7756.v1.patch > > > after set hadoop.security.authorization=true, start AMRMProxyService will > has such error > {quote}org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter > failed in state STARTED; cause: > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.security.authorize.AuthorizationException: Protocol > interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is not known. > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: > org.apache.hadoop.security.authorize.AuthorizationException: Protocol > interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is not known. > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:177) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.serviceStart(RMCommunicator.java:121) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.serviceStart(RMContainerAllocator.java:250) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.serviceStart(MRAppMaster.java:844) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:1114) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster$4.run(MRAppMaster.java:1529) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1803) > at > org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1525) > at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1458) > Caused by: org.apache.hadoop.security.authorize.AuthorizationException: > Protocol interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is > not known. > at sun.reflect.GeneratedConstructorAccessor14.newInstance(Unknown Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) > at > org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:109) > at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy36.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator.register(RMCommunicator.java:161) > ... 14 more > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): > Protocol interface org.apache.hadoop.yarn.api.ApplicationMasterProtocolPB is > not known. > at org.apache.hadoop.ipc.Client.call(Client.java:1476) > at org.apache.hadoop.ipc.Client.call(Client.java:1407) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy35.registerApplicationMaster(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.registerApplicationMaster(ApplicationMasterProtocolPBClientImpl.java:107) > ... 21 more > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Updated] (YARN-8873) Add CSI java-based client library
[ https://issues.apache.org/jira/browse/YARN-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8873: -- Attachment: YARN-8873.004.PATCH > Add CSI java-based client library > - > > Key: YARN-8873 > URL: https://issues.apache.org/jira/browse/YARN-8873 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8873.001.patch, YARN-8873.002.patch, > YARN-8873.003.patch, YARN-8873.004.PATCH > > > Build a java-based client to talk to CSI drivers, through CSI gRPC services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details
[ https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651181#comment-16651181 ] Wangda Tan commented on YARN-8875: -- Thanks [~liuxun323]. More comments: 1) Inside Index.md, should remove "Installation guide" / "Installation guide Chinese version" and link to HowToInstall.html. 2) There're many links use .md, how ever, since Hadoop generates html link, you should use .html instead, like. {code} [EN](InstallationGuide.md) # you should use ...html {code} 3) The image link doesn't work after doc generated. You should put it under resources/images like other images. Example: {code} ... See below screenshot: ![alt text](./images/tensorboard-service.png "Tensorboard service") ... {code} 4) Test site. You can use following command to test site: {code} mvn clean site:site -Preleasedocs; mvn site:stage -DstagingDirectory=/tmp/hadoop-site {code} Once it finishes, you can open /tmp/hadoop-site/hadoop-project/index.html from your browser and check the doc. Submarine can be found from left nav panel. > [Submarine] Add documentation for submarine installation script details > --- > > Key: YARN-8875 > URL: https://issues.apache.org/jira/browse/YARN-8875 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8875.001.patch, YARN-8875.002.patch, > YARN-8875.003.patch, YARN-8875.004.patch > > > YARN-8870: submarine installation guide -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zac Zhou updated YARN-8879: --- Attachment: YARN-8879.002.patch > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-8879.001.patch, YARN-8879.002.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development
[ https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8851: --- Attachment: YARN-8851-WIP4-trunk.001.patch > [Umbrella] A new pluggable device plugin framework to ease vendor plugin > development > > > Key: YARN-8851 > URL: https://issues.apache.org/jira/browse/YARN-8851 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8851-WIP2-trunk.001.patch, > YARN-8851-WIP3-trunk.001.patch, YARN-8851-WIP4-trunk.001.patch, [YARN-8851] > YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] > YARN_New_Device_Plugin_Framework_Design_Proposal.pdf > > > At present, we support GPU/FPGA device in YARN through a native, coupling > way. But it's difficult for a vendor to implement such a device plugin > because the developer needs much knowledge of YARN internals. And this brings > burden to the community to maintain both YARN core and vendor-specific code. > Here we propose a new device plugin framework to ease vendor device plugin > development and provide a more flexible way to integrate with YARN NM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651487#comment-16651487 ] Hadoop QA commented on YARN-8879: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 57s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 15s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core: The patch generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 25s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 1m 56s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 18s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8879 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944077/YARN-8879.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a595ece4ddb1 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0bf8a11 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/22199/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22199/testReport/ | | Max. process+thread count | 754 (vs. ulimit of 1) | | modules | C:
[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651242#comment-16651242 ] Sunil Govindan commented on YARN-8879: -- [~yuan_zac] With this, we will double validate that the principal name is not empty. However what will happen when keytab is not there. {{kerberosPrincipal.getKeytab()}} Do we need to validate this as well ? > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-8879.001.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8826) Fix lingering timeline collector after serviceStop in TimelineCollectorManager
[ https://issues.apache.org/jira/browse/YARN-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabha Manepalli updated YARN-8826: --- Attachment: YARN-8826.v2.patch > Fix lingering timeline collector after serviceStop in TimelineCollectorManager > -- > > Key: YARN-8826 > URL: https://issues.apache.org/jira/browse/YARN-8826 > Project: Hadoop YARN > Issue Type: Bug > Components: ATSv2 >Reporter: Prabha Manepalli >Assignee: Prabha Manepalli >Priority: Trivial > Attachments: YARN-8826.v1.patch, YARN-8826.v2.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8873) Add CSI java-based client library
[ https://issues.apache.org/jira/browse/YARN-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651273#comment-16651273 ] Weiwei Yang commented on YARN-8873: --- UT failure was not caused by this patch, see YARN-8856 for more information. > Add CSI java-based client library > - > > Key: YARN-8873 > URL: https://issues.apache.org/jira/browse/YARN-8873 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8873.001.patch, YARN-8873.002.patch, > YARN-8873.003.patch > > > Build a java-based client to talk to CSI drivers, through CSI gRPC services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8873) Add CSI java-based client library
[ https://issues.apache.org/jira/browse/YARN-8873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8873: -- Attachment: YARN-8873.003.patch > Add CSI java-based client library > - > > Key: YARN-8873 > URL: https://issues.apache.org/jira/browse/YARN-8873 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8873.001.patch, YARN-8873.002.patch, > YARN-8873.003.patch > > > Build a java-based client to talk to CSI drivers, through CSI gRPC services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651286#comment-16651286 ] Hadoop QA commented on YARN-8879: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 29s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 39s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 18s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core: The patch generated 1 new + 10 unchanged - 0 fixed = 11 total (was 10) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 31s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 15s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 69m 54s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8879 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944062/YARN-8879.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5e4dac877b45 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0bf8a11 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/22195/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22195/testReport/ | | Max. process+thread count | 754 (vs. ulimit of 1) | | modules | C:
[jira] [Created] (YARN-8887) Support isolation in pluggable device framework
Zhankun Tang created YARN-8887: -- Summary: Support isolation in pluggable device framework Key: YARN-8887 URL: https://issues.apache.org/jira/browse/YARN-8887 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang Devices isolation needs a complete description in API specs and a translator in the adapter to convert the requirements into uniform parameters passed to native container-executor. It should support both cgroups and Docker runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8851) [Umbrella] A new pluggable device plugin framework to ease vendor plugin development
[ https://issues.apache.org/jira/browse/YARN-8851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8851: --- Summary: [Umbrella] A new pluggable device plugin framework to ease vendor plugin development (was: [Umbrella] A new device plugin framework to ease vendor plugin development) > [Umbrella] A new pluggable device plugin framework to ease vendor plugin > development > > > Key: YARN-8851 > URL: https://issues.apache.org/jira/browse/YARN-8851 > Project: Hadoop YARN > Issue Type: New Feature > Components: yarn >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8851-WIP2-trunk.001.patch, > YARN-8851-WIP3-trunk.001.patch, [YARN-8851] > YARN_New_Device_Plugin_Framework_Design_Proposal-3.pdf, [YARN-8851] > YARN_New_Device_Plugin_Framework_Design_Proposal.pdf > > > At present, we support GPU/FPGA device in YARN through a native, coupling > way. But it's difficult for a vendor to implement such a device plugin > because the developer needs much knowledge of YARN internals. And this brings > burden to the community to maintain both YARN core and vendor-specific code. > Here we propose a new device plugin framework to ease vendor device plugin > development and provide a more flexible way to integrate with YARN NM. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8883) Provide an example of fake vendor plugin
Zhankun Tang created YARN-8883: -- Summary: Provide an example of fake vendor plugin Key: YARN-8883 URL: https://issues.apache.org/jira/browse/YARN-8883 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8890) Port existing GPU module into pluggable device framework
Zhankun Tang created YARN-8890: -- Summary: Port existing GPU module into pluggable device framework Key: YARN-8890 URL: https://issues.apache.org/jira/browse/YARN-8890 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang Once we get pluggable device framework mature, we can port existing GPU related code into this new framework. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework
[ https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8880: --- Description: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-classes {code} The admin needs to know the register resource name of every plugin classes configured. And declare them was: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-classes {code} > Add configurations for pluggable plugin framework > - > > Key: YARN-8880 > URL: https://issues.apache.org/jira/browse/YARN-8880 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > Added two configurations for the pluggable device framework. > {code:java} > > yarn.nodemanager.pluggable-device-framework.enable > true/false > > > yarn.nodemanager.resource-plugins.pluggable-classes > > {code} > The admin needs to know the register resource name of every plugin classes > configured. And declare them -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework
[ https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8880: --- Description: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-classes {code} The admin needs to know the register resource name of every plugin classes configured. And declare them in resource-types.xml. Please note that the count value defined in node-resource.xml will be overridden by plugin. was: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-classes {code} The admin needs to know the register resource name of every plugin classes configured. And declare them > Add configurations for pluggable plugin framework > - > > Key: YARN-8880 > URL: https://issues.apache.org/jira/browse/YARN-8880 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > Added two configurations for the pluggable device framework. > {code:java} > > yarn.nodemanager.pluggable-device-framework.enable > true/false > > > yarn.nodemanager.resource-plugins.pluggable-classes > > {code} > The admin needs to know the register resource name of every plugin classes > configured. And declare them in resource-types.xml. > Please note that the count value defined in node-resource.xml will be > overridden by plugin. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8891) Documentation of the pluggable device framework
Zhankun Tang created YARN-8891: -- Summary: Documentation of the pluggable device framework Key: YARN-8891 URL: https://issues.apache.org/jira/browse/YARN-8891 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zac Zhou updated YARN-8879: --- Attachment: (was: YARN-8879.patch) > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-8879.001.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zac Zhou updated YARN-8879: --- Attachment: YARN-8879.001.patch > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-8879.001.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8881) Add basic pluggable device plugin framework
Zhankun Tang created YARN-8881: -- Summary: Add basic pluggable device plugin framework Key: YARN-8881 URL: https://issues.apache.org/jira/browse/YARN-8881 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang It includes adding support in "ResourcePluginManager" to enable the framework and load plugin classes based on configuration, an interface for the vendor to implement and the adapter to decouple plugin and YARN internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8826) Fix lingering timeline collector after serviceStop in TimelineCollectorManager
[ https://issues.apache.org/jira/browse/YARN-8826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651290#comment-16651290 ] Hadoop QA commented on YARN-8826: - | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 7s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 24s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 43s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 55m 0s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8826 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944067/YARN-8826.v2.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 0a8a32731b65 3.13.0-144-generic #193-Ubuntu SMP Thu Mar 15 17:03:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0bf8a11 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22196/testReport/ | | Max. process+thread count | 339 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/22196/console | | Powered by | Apache Yetus 0.8.0
[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651307#comment-16651307 ] Zac Zhou commented on YARN-8879: Thanks, [~sunilg], There is logic for kerberosPrincipal.getKeytab() validation already. Make test case validate keytab more clearly in the 002 patch. > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Major > Attachments: YARN-8879.001.patch, YARN-8879.002.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8885) Support NM APIs to query device resource allocation
Zhankun Tang created YARN-8885: -- Summary: Support NM APIs to query device resource allocation Key: YARN-8885 URL: https://issues.apache.org/jira/browse/YARN-8885 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang Supprot REST API in NM for user to query allocation *_nodemanager_address:port/ws/v1/node/resources/\{resource_name}_* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8877) Extend service spec to allow setting resource attributes
[ https://issues.apache.org/jira/browse/YARN-8877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiwei Yang updated YARN-8877: -- Attachment: YARN-8877.001.patch > Extend service spec to allow setting resource attributes > > > Key: YARN-8877 > URL: https://issues.apache.org/jira/browse/YARN-8877 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Attachments: YARN-8877.001.patch > > > Extend yarn native service spec to support setting resource attributes in the > spec file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651349#comment-16651349 ] Weiwei Yang commented on YARN-8513: --- Hi [~leftnoteasy]/[~cyfdecyf]/[~hustnn]/[~Tao Yang] When this issue created with the logs attached, it looks very suspicious there is a bug in {{CS#allocateContainersToNode}}, causing it never breaks out from following while loop, {code:java} while (canAllocateMore(...)) {...} {code} then I suggested [~cyfdecyf] to change the property {{yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments}} from *-1* (the default value) to *10*. Which seems to work-around the issue. I think we should change the default value to 10, to have this sort of infinite greedy-lookup is not safe. I am going to submit a patch to change the default value. Thoughts? > CapacityScheduler infinite loop when queue is near fully utilized > - > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 3.1.0, 2.9.1 > Environment: Ubuntu 14.04.5 and 16.04.4 > YARN is configured with one label and 5 queues. >Reporter: Chen Yufei >Priority: Major > Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, > jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, > yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, > yarn3-resourcemanager.log, yarn3-top > > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used= > cluster=}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_01 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource= type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8880) Add configurations for pluggable plugin framework
Zhankun Tang created YARN-8880: -- Summary: Add configurations for pluggable plugin framework Key: YARN-8880 URL: https://issues.apache.org/jira/browse/YARN-8880 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.resource-plugins.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-class {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8884) Support monitoring of device resource through plugin API
Zhankun Tang created YARN-8884: -- Summary: Support monitoring of device resource through plugin API Key: YARN-8884 URL: https://issues.apache.org/jira/browse/YARN-8884 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang In the current design, the device resource count is reported by plugin when NM starts but won't got update even when there're devices broken. We should support monitoring and update the device resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8888) Support device topology scheduling
Zhankun Tang created YARN-: -- Summary: Support device topology scheduling Key: YARN- URL: https://issues.apache.org/jira/browse/YARN- Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Assignee: Zhankun Tang An easy way for vendor plugin to describe topology information should be provided in Device spec and the topology information will be used in the device shared local scheduler to boost performance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8889) Added well-defined interface in container-executor to support vendor plugins isolation request
Zhankun Tang created YARN-8889: -- Summary: Added well-defined interface in container-executor to support vendor plugins isolation request Key: YARN-8889 URL: https://issues.apache.org/jira/browse/YARN-8889 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhankun Tang Because of different container runtime, the isolation request from vendor device plugin may be raised before container launch (cgroups operations) or at container launch (Docker runtime). An easy to use interface in container-executor should be provided to support above requirements. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework
[ https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8880: --- Description: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-class {code} was: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.resource-plugins.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-class {code} > Add configurations for pluggable plugin framework > - > > Key: YARN-8880 > URL: https://issues.apache.org/jira/browse/YARN-8880 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > Added two configurations for the pluggable device framework. > {code:java} > > yarn.nodemanager.pluggable-device-framework.enable > true/false > > > yarn.nodemanager.resource-plugins.pluggable-class > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8880) Add configurations for pluggable plugin framework
[ https://issues.apache.org/jira/browse/YARN-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhankun Tang updated YARN-8880: --- Description: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-classes {code} was: Added two configurations for the pluggable device framework. {code:java} yarn.nodemanager.pluggable-device-framework.enable true/false yarn.nodemanager.resource-plugins.pluggable-class {code} > Add configurations for pluggable plugin framework > - > > Key: YARN-8880 > URL: https://issues.apache.org/jira/browse/YARN-8880 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > > Added two configurations for the pluggable device framework. > {code:java} > > yarn.nodemanager.pluggable-device-framework.enable > true/false > > > yarn.nodemanager.resource-plugins.pluggable-classes > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing
[ https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652044#comment-16652044 ] Hadoop QA commented on YARN-8798: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 22s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 8s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 16s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine: The patch generated 4 new + 25 unchanged - 0 fixed = 29 total (was 25) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 34s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 41s{color} | {color:green} hadoop-yarn-submarine in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 55m 25s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8798 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944163/YARN-8798-trunk.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux ba38624d20b6 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 0c2914e | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/22203/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-submarine.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22203/testReport/ | | Max. process+thread count | 306 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine U:
[jira] [Updated] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing
[ https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8798: - Target Version/s: 3.2.0 Priority: Critical (was: Major) > [Submarine] Job should not be submitted if "--input_path" option is missing > --- > > Key: YARN-8798 > URL: https://issues.apache.org/jira/browse/YARN-8798 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch > > > If a user doesn't set "–input_path" option, the job will still be submitted. > Here is my command to run the job: > {code:java} > yarn jar > $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > -verbose \ > -wait_job_finish \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \ > --name tf-job-001 \ > --docker_image tangzhankun/tensorflow \ > --worker_resources memory=4G,vcores=2 \ > --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py > --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 > --train-steps=5"{code} > Due to lack of invalidity check, the job is still submitted. We should add a > check on this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-8892: - Attachment: YARN-8892.001.patch > YARN UI2 doc improvement to update security status > -- > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-8892: - Fix Version/s: (was: 3.2.0) > YARN UI2 doc improvement to update security status > -- > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil Govindan updated YARN-8892: - Target Version/s: 3.2.0 > YARN UI2 doc improvement to update security status > -- > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8862: --- Attachment: YARN-8862-YARN-7402.v4.patch > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch, > YARN-8862-YARN-7402.v4.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7086) Release all containers aynchronously
[ https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652065#comment-16652065 ] Manikandan R edited comment on YARN-7086 at 10/16/18 4:57 PM: -- [~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing log level. With these changes, ran the test cases again and measurements (in ms) between different runs for each cases doesn't differ drastically. In addition to three cases, since original intent of this Jira is to release container asynchronously to avoid potential deadlocks, added 4th case of releasing container asynchronously for every single container sequentially just to understand the difference between multiple container list traversal vs handling single container separately. Based on the below results, 2nd case - multiple container list traversal is not only reduce the performance but increase the complexity of the code. With 4th case, code changes are simple and clean. Though 4th case time taken is high compared to 1st & 3rd case, can we pick 4th case given that we want to release containers async? Thoughts? ||Run||Existing code||With Patch (Async release + multiple container list traversal)||With Patch (Not Async release + multiple container list traversal) ||With Patch (Async Release for each container separately)|| |1|496|1430 |444|1067| |2|490|1604 |453 |1401| |3|427|1133 |438|972| |4|482|1342 |429 |1228| |5|459|1106 |412 |1176| |Average of 5 runs|470.8|1323|435.2|1168.8| was (Author: maniraj...@gmail.com): [~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing log level. With these changes, ran the test cases again and measurements (in ms) between different runs for each cases doesn't differ drastically. In addition to three cases, since original intent of this Jira is to release container asynchronously, added 4th case of releasing container asynchronously for every single container sequentially just to understand the difference between multiple container list traversal vs handling single container separately. Based on the below results, 2nd case - multiple container list traversal is not only reduce the performance but increase the complexity of the code. With 4th case, code changes are simple and clean. Though 4th case time taken is high compared to 1st & 3rd case, can we pick 4th case given that we want to release containers async? Thoughts? ||Run||Existing code||With Patch (Async release + multiple container list traversal)||With Patch (Not Async release + multiple container list traversal) ||With Patch (Async Release for each container separately)|| |1|496|1430 |444|1067| |2|490|1604 |453 |1401| |3|427|1133 |438|972| |4|482|1342 |429 |1228| |5|459|1106 |412 |1176| |Average of 5 runs|470.8|1323|435.2|1168.8| > Release all containers aynchronously > > > Key: YARN-7086 > URL: https://issues.apache.org/jira/browse/YARN-7086 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Arun Suresh >Assignee: Manikandan R >Priority: Major > Attachments: YARN-7086.001.patch, YARN-7086.002.patch, > YARN-7086.Perf-test-case.patch > > > We have noticed in production two situations that can cause deadlocks and > cause scheduling of new containers to come to a halt, especially with regard > to applications that have a lot of live containers: > # When these applicaitons release these containers in bulk. > # When these applications terminate abruptly due to some failure, the > scheduler releases all its live containers in a loop. > To handle the issues mentioned above, we have a patch in production to make > sure ALL container releases happen asynchronously - and it has served us well. > Opening this JIRA to gather feedback on if this is a good idea generally (cc > [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd]) > BTW, In YARN-6251, we already have an asyncReleaseContainer() in the > AbstractYarnScheduler and a corresponding scheduler event, which is currently > used specifically for the container-update code paths (where the scheduler > realeases temp containers which it creates for the update) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8875) [Submarine] Add documentation for submarine installation script details
[ https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8875: - Attachment: YARN-8875.005.patch > [Submarine] Add documentation for submarine installation script details > --- > > Key: YARN-8875 > URL: https://issues.apache.org/jira/browse/YARN-8875 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8875.001.patch, YARN-8875.002.patch, > YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch > > > YARN-8870: submarine installation guide -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details
[ https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652101#comment-16652101 ] Wangda Tan commented on YARN-8875: -- Fixed doc issues I mentioned above. (005) > [Submarine] Add documentation for submarine installation script details > --- > > Key: YARN-8875 > URL: https://issues.apache.org/jira/browse/YARN-8875 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8875.001.patch, YARN-8875.002.patch, > YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch > > > YARN-8870: submarine installation guide -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8647) Add a flag to disable move app between queues
[ https://issues.apache.org/jira/browse/YARN-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652189#comment-16652189 ] Eric Payne commented on YARN-8647: -- {quote}{quote}Addition flag to disable required {quote} {quote}IMO we shouldn't require another flag to disable it as we are already checking for all the permissions. {quote} we want to disable the feature of move queue on cluster level instead of disabling for few users {quote} [~saruntek], I understand that it would be easier for an admin to just set the flag and forget it. However, I would argue that if a multi-tenant cluster has hundreds of users, it's time to manage the usage more closely by utilizing ACL permissions on each queue. In general, I would prefer to not add code to the scheduler when there is already a pre-designed alternative. > Add a flag to disable move app between queues > - > > Key: YARN-8647 > URL: https://issues.apache.org/jira/browse/YARN-8647 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.7.3 >Reporter: sarun singla >Assignee: Abhishek Modi >Priority: Critical > > For large clusters where we have a number of users submitting application, we > can result into scenarios where app developers try to move the queues for > their applications using something like > {code:java} > yarn application -movetoqueue -queue {code} > Today there is no way of disabling the feature if one does not want > application developers to use the feature. > *Solution:* > We should probably add an option to disable move queue feature from RM side > on the cluster level. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8481) AMRMProxyPolicies should accept heartbeat response from new/unknown subclusters
[ https://issues.apache.org/jira/browse/YARN-8481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8481: --- Issue Type: Sub-task (was: Bug) Parent: YARN-5597 > AMRMProxyPolicies should accept heartbeat response from new/unknown > subclusters > --- > > Key: YARN-8481 > URL: https://issues.apache.org/jira/browse/YARN-8481 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Minor > Fix For: 2.10.0, 3.2.0, 2.9.2 > > Attachments: YARN-8481.v1.patch > > > Currently BroadcastAMRMProxyPolicy assumes that we only span the application > to the sub-clusters instructed by itself via _splitResourceRequests_. > However, with AMRMProxy HA, second attempts of the application might come up > with multiple sub-clusters initially without consulting the AMRMProxyPolicy > at all. This leads to exceptions in _notifyOfResponse._ It should simply > allow the new/unknown sub-cluster heartbeat responses. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271 ] Wangda Tan edited comment on YARN-8489 at 10/16/18 7:41 PM: [~eyang], Basically there're four modes in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. was (Author: leftnoteasy): [~eyang], Basically there're four models in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271 ] Wangda Tan edited comment on YARN-8489 at 10/16/18 7:42 PM: [~eyang], Basically there're four modes in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the same service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case next year or so, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. was (Author: leftnoteasy): [~eyang], Basically there're four modes in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8879: - Priority: Critical (was: Major) > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Critical > Attachments: YARN-8879.001.patch, YARN-8879.002.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8879: - Target Version/s: 3.2.0 > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Critical > Attachments: YARN-8879.001.patch, YARN-8879.002.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing
[ https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652046#comment-16652046 ] Wangda Tan commented on YARN-8798: -- +1, thanks [~tangzhankun]. [~sunilg], please let me know if you have any concerns to put it to 3.2.0. > [Submarine] Job should not be submitted if "--input_path" option is missing > --- > > Key: YARN-8798 > URL: https://issues.apache.org/jira/browse/YARN-8798 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch > > > If a user doesn't set "–input_path" option, the job will still be submitted. > Here is my command to run the job: > {code:java} > yarn jar > $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > -verbose \ > -wait_job_finish \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \ > --name tf-job-001 \ > --docker_image tangzhankun/tensorflow \ > --worker_resources memory=4G,vcores=2 \ > --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py > --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 > --train-steps=5"{code} > Due to lack of invalidity check, the job is still submitted. We should add a > check on this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing
[ https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652060#comment-16652060 ] Sunil Govindan commented on YARN-8798: -- Thanks [~leftnoteasy]. I think its better to put it to 3.2 as well. > [Submarine] Job should not be submitted if "--input_path" option is missing > --- > > Key: YARN-8798 > URL: https://issues.apache.org/jira/browse/YARN-8798 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch > > > If a user doesn't set "–input_path" option, the job will still be submitted. > Here is my command to run the job: > {code:java} > yarn jar > $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > -verbose \ > -wait_job_finish \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \ > --name tf-job-001 \ > --docker_image tangzhankun/tensorflow \ > --worker_resources memory=4G,vcores=2 \ > --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py > --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 > --train-steps=5"{code} > Due to lack of invalidity check, the job is still submitted. We should add a > check on this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8892) YARN UI2 doc improvement to update security status
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652136#comment-16652136 ] Hadoop QA commented on YARN-8892: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 34m 1s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 50m 23s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8892 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944171/YARN-8892.001.patch | | Optional Tests | dupname asflicense mvnsite | | uname | Linux 99daf898a191 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 753f149 | | maven | version: Apache Maven 3.3.9 | | Max. process+thread count | 307 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/22204/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > YARN UI2 doc improvement to update security status > -- > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652163#comment-16652163 ] Wangda Tan commented on YARN-8489: -- [~eyang], I have thought about this, but it seems to me both existing readiness check are insufficient. In YARN service, dependency is for launch order as well as readiness. It has to be a DAG. However In TF for example, master and ps are not depends on each other for launch time, but once master succeeded or failed, we should give the same state to job. And once ps failed, we should mark job is failed as well. Maybe "dominant" is not the best field to add, for TF training use cases, it seems sufficient. But if we want better extensibility, we can add a ServiceControlPlugin into service master, which app master can specify their own implementation. Which should be good for people who wants to integrate to service framework. Suggestions? [~billie.rinaldi], [~gsaha]. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8862) [GPG] add Yarn Registry cleanup in ApplicationCleaner
[ https://issues.apache.org/jira/browse/YARN-8862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652313#comment-16652313 ] Giovanni Matteo Fumarola commented on YARN-8862: Thanks [~botong] for the patch. NIT: GlobalPolicyGenerator#serviceStop() is missing a NP check. Otherwise, it is +1. > [GPG] add Yarn Registry cleanup in ApplicationCleaner > - > > Key: YARN-8862 > URL: https://issues.apache.org/jira/browse/YARN-8862 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8862-YARN-7402.v1.patch, > YARN-8862-YARN-7402.v2.patch, YARN-8862-YARN-7402.v3.patch > > > In Yarn Federation, we use Yarn Registry to use the AMToken for UAMs in > secondary sub-clusters. Because of potential more app attempts later, > AMRMProxy cannot kill the UAM and delete the tokens when one local attempt > finishes. So similar to the StateStore application table, we need > ApplicationCleaner in GPG to cleanup the finished app entries in Yarn > Registry. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details
[ https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652173#comment-16652173 ] Hadoop QA commented on YARN-8875: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 31m 49s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 50s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 25s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 46m 44s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8875 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944177/YARN-8875.005.patch | | Optional Tests | dupname asflicense mvnsite | | uname | Linux 86d6bf7e71ed 3.13.0-144-generic #193-Ubuntu SMP Thu Mar 15 17:03:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 753f149 | | maven | version: Apache Maven 3.3.9 | | Max. process+thread count | 340 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/22205/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > [Submarine] Add documentation for submarine installation script details > --- > > Key: YARN-8875 > URL: https://issues.apache.org/jira/browse/YARN-8875 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8875.001.patch, YARN-8875.002.patch, > YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch > > > YARN-8870: submarine installation guide -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271 ] Wangda Tan commented on YARN-8489: -- [~eyang], Basically there're four models in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8892) YARN UI2 doc improvement to update security status
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652053#comment-16652053 ] Sunil Govindan commented on YARN-8892: -- cc [~leftnoteasy] cud u pls review > YARN UI2 doc improvement to update security status > -- > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7086) Release all containers aynchronously
[ https://issues.apache.org/jira/browse/YARN-7086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652065#comment-16652065 ] Manikandan R commented on YARN-7086: [~jlowe] Reduced I/O's by removing unnecessary stdout printing and reducing log level. With these changes, ran the test cases again and measurements (in ms) between different runs for each cases doesn't differ drastically. In addition to three cases, since original intent of this Jira is to release container asynchronously, added 4th case of releasing container asynchronously for every single container sequentially just to understand the difference between multiple container list traversal vs handling single container separately. Based on the below results, 2nd case - multiple container list traversal is not only reduce the performance but increase the complexity of the code. With 4th case, code changes are simple and clean. Though 4th case time taken is high compared to 1st & 3rd case, can we pick 4th case given that we want to release containers async? Thoughts? ||Run||Existing code||With Patch (Async release + multiple container list traversal)||With Patch (Not Async release + multiple container list traversal) ||With Patch (Async Release for each container separately)|| |1|496|1430 |444|1067| |2|490|1604 |453 |1401| |3|427|1133 |438|972| |4|482|1342 |429 |1228| |5|459|1106 |412 |1176| |Average of 5 runs|470.8|1323|435.2|1168.8| > Release all containers aynchronously > > > Key: YARN-7086 > URL: https://issues.apache.org/jira/browse/YARN-7086 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Reporter: Arun Suresh >Assignee: Manikandan R >Priority: Major > Attachments: YARN-7086.001.patch, YARN-7086.002.patch, > YARN-7086.Perf-test-case.patch > > > We have noticed in production two situations that can cause deadlocks and > cause scheduling of new containers to come to a halt, especially with regard > to applications that have a lot of live containers: > # When these applicaitons release these containers in bulk. > # When these applications terminate abruptly due to some failure, the > scheduler releases all its live containers in a loop. > To handle the issues mentioned above, we have a patch in production to make > sure ALL container releases happen asynchronously - and it has served us well. > Opening this JIRA to gather feedback on if this is a good idea generally (cc > [~leftnoteasy], [~jlowe], [~curino], [~kasha], [~subru], [~roniburd]) > BTW, In YARN-6251, we already have an asyncReleaseContainer() in the > AbstractYarnScheduler and a corresponding scheduler event, which is currently > used specifically for the container-update code paths (where the scheduler > realeases temp containers which it creates for the update) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details
[ https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652121#comment-16652121 ] Sunil Govindan commented on YARN-8875: -- +1. > [Submarine] Add documentation for submarine installation script details > --- > > Key: YARN-8875 > URL: https://issues.apache.org/jira/browse/YARN-8875 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8875.001.patch, YARN-8875.002.patch, > YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch > > > YARN-8870: submarine installation guide -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8448) AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652181#comment-16652181 ] Robert Kanter commented on YARN-8448: - {{TestCapacityOverTimePolicy}} failure is unrelated. I'm not sure why cetest failed (it doesn't give any details), and it passes on my machine. > AM HTTPS Support > > > Key: YARN-8448 > URL: https://issues.apache.org/jira/browse/YARN-8448 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8448.001.patch, YARN-8448.002.patch, > YARN-8448.003.patch, YARN-8448.004.patch, YARN-8448.005.patch, > YARN-8448.006.patch, YARN-8448.007.patch, YARN-8448.008.patch, > YARN-8448.009.patch, YARN-8448.010.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652248#comment-16652248 ] Eric Yang commented on YARN-8489: - [~leftnoteasy] {quote}master and ps are not depends on each other for launch time{quote} While the launch statement is correct, but it is not true for Tensorflow run time. For master (jupyter notebook) to send any workload to parameter server, parameter server must be running. There is an implicit dependency that can be defined for master depends on ps to improve usability. {quote}And once ps failed, we should mark job is failed as well.{quote} Parameter server is on the critical path, but it is not completely true that one ps fail, we may want to abort the service. The running job needs to be terminated, but mapping Tensorflow task to YARN container is a problematic design. I am most concerned about this in submarine implementation of Tensorflow. Especially, the people sit in front of jupyter notebook can observe that parameter server has failed, and use other parameter servers and continue to work. It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. It may be nice to have a method to clean up the service, when the single critical component has failed. By using yarn app -destroy, this can happen at the time that user is ready to make a change, instead of losing all state right away to keep system clean. Dominant component logic nor the plugin approach are not the right methods to address the design problem in submarine working model because AM state machine is currently incomplete, any plugin to override AM state machine seems like pouring gas on flames. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8868) Set HTTPOnly attribute to Cookie
[ https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8868: Attachment: YARN-8810.002.patch > Set HTTPOnly attribute to Cookie > > > Key: YARN-8868 > URL: https://issues.apache.org/jira/browse/YARN-8868 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8868.001.patch > > > 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, > but fails to set the HttpOnly flag to true. > 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and > 388, but fails to set the HttpOnly flag to true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8868) Set HTTPOnly attribute to Cookie
[ https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8868: Attachment: (was: YARN-8810.002.patch) > Set HTTPOnly attribute to Cookie > > > Key: YARN-8868 > URL: https://issues.apache.org/jira/browse/YARN-8868 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8868.001.patch > > > 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, > but fails to set the HttpOnly flag to true. > 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and > 388, but fails to set the HttpOnly flag to true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8892) YARN UI2 doc improvement to update security status
Sunil Govindan created YARN-8892: Summary: YARN UI2 doc improvement to update security status Key: YARN-8892 URL: https://issues.apache.org/jira/browse/YARN-8892 Project: Hadoop YARN Issue Type: Bug Reporter: Sunil Govindan Assignee: Sunil Govindan Fix For: 3.2.0 UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile
[ https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8810: Attachment: YARN-8810.002.patch > Yarn Service: discrepancy between hashcode and equals of ConfigFile > --- > > Key: YARN-8810 > URL: https://issues.apache.org/jira/browse/YARN-8810 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Minor > Attachments: YARN-8810.001.patch, YARN-8810.002.patch > > > The {{ConfigFile}} class {{equals}} method doesn't check the equality of > {{properties}}. The {{hashCode}} does include the {{properties}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652036#comment-16652036 ] Wangda Tan commented on YARN-8879: -- +1, thanks [~sunilg], please go ahead and get it committed. > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Critical > Attachments: YARN-8879.001.patch, YARN-8879.002.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8879) Kerberos principal is needed when submitting a submarine job
[ https://issues.apache.org/jira/browse/YARN-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652090#comment-16652090 ] Hudson commented on YARN-8879: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15225 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15225/]) YARN-8879. Kerberos principal is needed when submitting a submarine job. (sunilg: rev 753f149fd3f5acf9a98cfc780d7899e307c19002) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/utils/ServiceApiUtil.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/utils/TestServiceApiUtil.java > Kerberos principal is needed when submitting a submarine job > > > Key: YARN-8879 > URL: https://issues.apache.org/jira/browse/YARN-8879 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Zac Zhou >Assignee: Zac Zhou >Priority: Critical > Fix For: 3.2.0, 3.3.0 > > Attachments: YARN-8879.001.patch, YARN-8879.002.patch > > > when I submitted a submarine job like this: > {code:java} > ./yarn jar > /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > --env DOCKER_JAVA_HOME=/opt/java \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name distributed-tf-gpu \ > --env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \ > --worker_docker_image 10.120.196.232:5000/gpu-cuda9.0-tf1.8.0-with-models-7 \ > --input_path hdfs://mldev/tmp/cifar-10-data \ > --checkpoint_path hdfs://mldev/user/hadoop/tf-distributed-checkpoint \ > --num_ps 1 \ > --ps_resources memory=4G,vcores=2,gpu=0 \ > --ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --num-gpus=0" \ > --ps_docker_image 10.120.196.232:5000/dockerfile-cpu-tf1.8.0-with-models \ > --worker_resources memory=4G,vcores=2,gpu=1 --verbose \ > --num_workers 2 \ > --worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py > --data-dir=hdfs://mldev/tmp/cifar-10-data > --job-dir=hdfs://mldev/tmp/cifar-10-jobdir --train-steps=500 > --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1" {code} > > The following error as got: > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Kerberos > principal or keytab is missing. > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateKerberosPrincipal(ServiceApiUtil.java:255) > at > org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:134) > at > org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:467) > at > org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542) > at > org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231) > at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:323) > at org.apache.hadoop.util.RunJar.main(RunJar.java:236){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile
[ https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652263#comment-16652263 ] Chandni Singh commented on YARN-8810: - Added a unit test and a fix to {{DefaultComponentsFinder}} in patch 2 > Yarn Service: discrepancy between hashcode and equals of ConfigFile > --- > > Key: YARN-8810 > URL: https://issues.apache.org/jira/browse/YARN-8810 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Minor > Attachments: YARN-8810.001.patch, YARN-8810.002.patch > > > The {{ConfigFile}} class {{equals}} method doesn't check the equality of > {{properties}}. The {{hashCode}} does include the {{properties}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8449) RM HA for AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-8449: Attachment: YARN-8449.001.patch > RM HA for AM HTTPS Support > -- > > Key: YARN-8449 > URL: https://issues.apache.org/jira/browse/YARN-8449 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8449.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8870) [Submarine] Add submarine installation scripts
[ https://issues.apache.org/jira/browse/YARN-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652469#comment-16652469 ] Hudson commented on YARN-8870: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15231/]) YARN-8870. [Submarine] Add submarine installation scripts. (Xun Liu via (wangda: rev 46d6e0016610ced51a76189daeb3ad0e3dbbf94c) * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/nvidia.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/docker/docker.service * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/nvidia-docker.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/submarine/submarine.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/docker/daemon.json * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/install.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/calico/calico-node.service * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/menu.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/install.conf * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/docker.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/etcd/etcd.service * (edit) hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/utils.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/hadoop/container-executor.cfg * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/hadoop.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/download-server.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/etcd.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/calico.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/environment.sh * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/package/calico/calicoctl.cfg * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/installation/scripts/submarine.sh > [Submarine] Add submarine installation scripts > -- > > Key: YARN-8870 > URL: https://issues.apache.org/jira/browse/YARN-8870 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8870.001.patch, YARN-8870.004.patch, > YARN-8870.005.patch, YARN-8870.006.patch, YARN-8870.007.patch > > > In order to reduce the deployment difficulty of Hadoop > {Submarine} DNS, Docker, GPU, Network, graphics card, operating system kernel > modification and other components, I specially developed this installation > script to deploy Hadoop \{Submarine} > runtime environment, providing one-click installation Scripts, which can also > be used to install, uninstall, start, and stop individual components step by > step. > > {color:#ff}design d{color}{color:#FF}ocument:{color} > [https://docs.google.com/document/d/1muCTGFuUXUvM4JaDYjKqX5liQEg-AsNgkxfLMIFxYHU/edit?usp=sharing] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8875) [Submarine] Add documentation for submarine installation script details
[ https://issues.apache.org/jira/browse/YARN-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652468#comment-16652468 ] Hudson commented on YARN-8875: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15231/]) YARN-8875. [Submarine] Add documentation for submarine installation (wangda: rev ed08dd3b0c9cec20373e8ca4e34d6526bd759943) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationGuide.md * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/resources/images/submarine-installer.gif * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/HowToInstall.md * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/TestAndTroubleshooting.md * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/Index.md * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptCN.md > [Submarine] Add documentation for submarine installation script details > --- > > Key: YARN-8875 > URL: https://issues.apache.org/jira/browse/YARN-8875 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Xun Liu >Assignee: Xun Liu >Priority: Critical > Attachments: YARN-8875.001.patch, YARN-8875.002.patch, > YARN-8875.003.patch, YARN-8875.004.patch, YARN-8875.005.patch > > > YARN-8870: submarine installation guide -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8798) [Submarine] Job should not be submitted if "--input_path" option is missing
[ https://issues.apache.org/jira/browse/YARN-8798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652470#comment-16652470 ] Hudson commented on YARN-8798: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15231/]) YARN-8798. [Submarine] Job should not be submitted if --input_path (wangda: rev 143d74775b2b62884090fdd88874134b9eab2888) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/main/java/org/apache/hadoop/yarn/submarine/client/cli/param/RunJobParameters.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/test/java/org/apache/hadoop/yarn/submarine/client/cli/TestRunJobCliParsing.java > [Submarine] Job should not be submitted if "--input_path" option is missing > --- > > Key: YARN-8798 > URL: https://issues.apache.org/jira/browse/YARN-8798 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Critical > Attachments: YARN-8798-trunk.001.patch, YARN-8798-trunk.002.patch > > > If a user doesn't set "–input_path" option, the job will still be submitted. > Here is my command to run the job: > {code:java} > yarn jar > $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar > job run \ > -verbose \ > -wait_job_finish \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \ > --name tf-job-001 \ > --docker_image tangzhankun/tensorflow \ > --worker_resources memory=4G,vcores=2 \ > --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py > --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 > --train-steps=5"{code} > Due to lack of invalidity check, the job is still submitted. We should add a > check on this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652471#comment-16652471 ] Hudson commented on YARN-8892: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15231 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15231/]) YARN-8892. YARN UI2 doc changes to update security status (verified (wangda: rev 538250db26ce0b261bb74053348cddfc2d65cf52) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnUI2.md > YARN UI2 doc changes to update security status (verified under security > environment) > > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Blocker > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8892) YARN UI2 doc improvement to update security status
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652405#comment-16652405 ] Wangda Tan commented on YARN-8892: -- +1, committing, thanks [~sunilg]. > YARN UI2 doc improvement to update security status > -- > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Major > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8892) YARN UI2 doc improvement to update security status
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8892: - Priority: Blocker (was: Major) > YARN UI2 doc improvement to update security status > -- > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Blocker > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420 ] Eric Yang commented on YARN-8489: - [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and this is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name": "worker", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "launch_command": "python worker.py", "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" } ] } {code} ps.py {code} server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) server.join() {code} In jupyter notebook: User can write code on the fly: {code} with tf.Session("grpc://worker7.example.com:") as sess: for _ in range(1): sess.run(train_op) {code} Isn't this the easiest way to iterate in notebook without going through ps/worker setup per iteration? The only thing that user needs to write is worker.py which is use case driven. Am I missing something? > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420 ] Eric Yang edited comment on YARN-8489 at 10/16/18 8:49 PM: --- [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and the architecture is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name": "worker", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "launch_command": "python worker.py", "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" } ] } {code} ps.py {code} server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) server.join() {code} In jupyter notebook: User can write code on the fly: {code} with tf.Session("grpc://worker7.example.com:") as sess: for _ in range(1): sess.run(train_op) {code} Isn't this the easiest way to iterate in notebook without going through ps/worker setup per iteration? The only thing that user needs to write is worker.py which is use case driven. Am I missing something? was (Author: eyang): [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and this is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name":
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Component/s: federation amrmproxy > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8893.v1.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8842) Update QueueMetrics with custom resource values
[ https://issues.apache.org/jira/browse/YARN-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652443#comment-16652443 ] Haibo Chen commented on YARN-8842: -- +1 on the latest patch. I'll fix the one minor indentation checkstyle issues at the time of my commit. > Update QueueMetrics with custom resource values > > > Key: YARN-8842 > URL: https://issues.apache.org/jira/browse/YARN-8842 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8842.001.patch, YARN-8842.002.patch, > YARN-8842.003.patch, YARN-8842.004.patch, YARN-8842.005.patch, > YARN-8842.006.patch, YARN-8842.007.patch, YARN-8842.008.patch, > YARN-8842.009.patch, YARN-8842.010.patch, YARN-8842.011.patch, > YARN-8842.012.patch > > > This is the 2nd dependent jira of YARN-8059. > As updating the metrics is an independent step from handling preemption, this > jira only deals with the queue metrics update of custom resources. > The following metrics should be updated: > * allocated resources > * available resources > * pending resources > * reserved resources > * aggregate seconds preempted -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8892: - Target Version/s: 3.2.0, 3.1.2 (was: 3.2.0) > YARN UI2 doc changes to update security status (verified under security > environment) > > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Blocker > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8868) Set HTTPOnly attribute to Cookie
[ https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chandni Singh updated YARN-8868: Attachment: old_rm_ui.png new_rm_ui.png > Set HTTPOnly attribute to Cookie > > > Key: YARN-8868 > URL: https://issues.apache.org/jira/browse/YARN-8868 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8868.001.patch, new_rm_ui.png, old_rm_ui.png > > > 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, > but fails to set the HttpOnly flag to true. > 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and > 388, but fails to set the HttpOnly flag to true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8868) Set HTTPOnly attribute to Cookie
[ https://issues.apache.org/jira/browse/YARN-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652651#comment-16652651 ] Chandni Singh commented on YARN-8868: - [~sunilg] I checked the patch in a kerberized cluster. Both new RM UI and old RM UI is visible. Attached is the screenshot. > Set HTTPOnly attribute to Cookie > > > Key: YARN-8868 > URL: https://issues.apache.org/jira/browse/YARN-8868 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Attachments: YARN-8868.001.patch > > > 1. The program creates a cookie in Dispatcher.java at line 182, 185 and 199, > but fails to set the HttpOnly flag to true. > 2. The program creates a cookie in WebAppProxyServlet.java at line 141 and > 388, but fails to set the HttpOnly flag to true. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8448) AM HTTPS Support for AM communication with RMWeb proxy
[ https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-8448: - Summary: AM HTTPS Support for AM communication with RMWeb proxy (was: AM HTTPS Support) > AM HTTPS Support for AM communication with RMWeb proxy > -- > > Key: YARN-8448 > URL: https://issues.apache.org/jira/browse/YARN-8448 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8448.001.patch, YARN-8448.002.patch, > YARN-8448.003.patch, YARN-8448.004.patch, YARN-8448.005.patch, > YARN-8448.006.patch, YARN-8448.007.patch, YARN-8448.008.patch, > YARN-8448.009.patch, YARN-8448.010.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8448) AM HTTPS Support for AM communication with RMWeb proxy
[ https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652431#comment-16652431 ] Hudson commented on YARN-8448: -- FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #15230 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15230/]) YARN-8448. AM HTTPS Support for AM communication with RMWeb proxy. (haibochen: rev c2288ac45b748b4119442c46147ccc324926c340) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/test/java/org/apache/hadoop/yarn/server/webproxy/TestProxyCA.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/LinuxContainerRuntimeConstants.java * (edit) hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/security/Credentials.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/DefaultContainerExecutor.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxy.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/TestDockerContainerRuntime.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/ProxyCA.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerRelaunch.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMContext.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DockerLinuxContainerRuntime.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestLinuxContainerExecutorWithMocks.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestDefaultContainerExecutor.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/util.h * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestProxyCAManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/TestContainerLaunch.java * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/security/ProxyCAManager.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/ApplicationConstants.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/linux/runtime/DefaultLinuxContainerRuntime.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-web-proxy/src/main/java/org/apache/hadoop/yarn/server/webproxy/WebAppProxyServlet.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/executor/ContainerStartContext.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c * (add) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/AMSecretKeys.java * (edit) hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/security/ssl/KeyStoreTestUtil.java * (edit)
[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652547#comment-16652547 ] Eric Yang commented on YARN-8489: - [~leftnoteasy] Notebook can communicate to ps or workers via grpc the same. The example was trying to grpc access to a worker instead of making assumption that notebook is PS. PS helps to build the task that workers are going to execute more efficiently. Data scientist specify the cluster spec in notebook, parameter server partitions the models and tasks to increase workers effectiveness. We digressed from original goal of this JIRA. My point is dependency expression and refine YARN service state machine can achieve what you are proposing with additional switch. Additional switch may have unforeseen consequence to existing operations. For example, what happen if during upgrade the dominant component is offline. Should the service terminate and clean up? How about flex dominant component to lesser nodes? What is the order to evaluate dominant component and component dependencies? How to handle restart policy in place of dominant component? It would be helpful to draw a state diagram to explain the proposal to see if this idea is worth pursuing. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8778) Add Command Line interface to invoke interactive docker shell
[ https://issues.apache.org/jira/browse/YARN-8778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8778: Attachment: YARN-8778.003.patch > Add Command Line interface to invoke interactive docker shell > - > > Key: YARN-8778 > URL: https://issues.apache.org/jira/browse/YARN-8778 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zian Chen >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8778.001.patch, YARN-8778.002.patch, > YARN-8778.003.patch > > > CLI will be the mandatory interface we are providing for a user to use > interactive docker shell feature. We will need to create a new class > “InteractiveDockerShellCLI” to read command line into the servlet and pass > all the way down to docker executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Issue Type: Sub-task (was: Task) Parent: YARN-5597 > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
[ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652645#comment-16652645 ] Wangda Tan commented on YARN-8513: -- Sounds like a plan, default value set to 100 may make more sense. thanks [~cheersyang] > CapacityScheduler infinite loop when queue is near fully utilized > - > > Key: YARN-8513 > URL: https://issues.apache.org/jira/browse/YARN-8513 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, yarn >Affects Versions: 3.1.0, 2.9.1 > Environment: Ubuntu 14.04.5 and 16.04.4 > YARN is configured with one label and 5 queues. >Reporter: Chen Yufei >Priority: Major > Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, > jstack-5.log, top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, > yarn3-jstack2.log, yarn3-jstack3.log, yarn3-jstack4.log, yarn3-jstack5.log, > yarn3-resourcemanager.log, yarn3-top > > > ResourceManager does not respond to any request when queue is near fully > utilized sometimes. Sending SIGTERM won't stop RM, only SIGKILL can. After RM > restart, it can recover running jobs and start accepting new ones. > > Seems like CapacityScheduler is in an infinite loop printing out the > following log messages (more than 25,000 lines in a second): > > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: > assignedContainer queue=root usedCapacity=0.99816763 > absoluteUsedCapacity=0.99816763 used= > cluster=}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: > Failed to accept allocation proposal}} > {{2018-07-10 17:16:29,227 INFO > org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_1530619767030_1652_01 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943 > clusterResource= type=NODE_LOCAL > requestedPartition=}} > > I encounter this problem several times after upgrading to YARN 2.9.1, while > the same configuration works fine under version 2.7.3. > > YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a > similar problem. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile
[ https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652662#comment-16652662 ] Chandni Singh commented on YARN-8810: - [~eyang] thanks for reviewing and merging the patch > Yarn Service: discrepancy between hashcode and equals of ConfigFile > --- > > Key: YARN-8810 > URL: https://issues.apache.org/jira/browse/YARN-8810 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Minor > Fix For: 3.2.0, 3.1.2, 3.3.0 > > Attachments: YARN-8810.001.patch, YARN-8810.002.patch > > > The {{ConfigFile}} class {{equals}} method doesn't check the equality of > {{properties}}. The {{hashCode}} does include the {{properties}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8582) Documentation for AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Kanter updated YARN-8582: Attachment: YARN-8582.004.patch > Documentation for AM HTTPS Support > -- > > Key: YARN-8582 > URL: https://issues.apache.org/jira/browse/YARN-8582 > Project: Hadoop YARN > Issue Type: Sub-task > Components: docs >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8582.001.patch, YARN-8582.002.patch, > YARN-8582.003.patch, YARN-8582.004.patch > > > Documentation for YARN-6586. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8582) Documentation for AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652681#comment-16652681 ] Robert Kanter commented on YARN-8582: - The 004 patch: - Update doc patch with config property description as in the final YARN-8448 patch - Fixed a missed OFF to NONE > Documentation for AM HTTPS Support > -- > > Key: YARN-8582 > URL: https://issues.apache.org/jira/browse/YARN-8582 > Project: Hadoop YARN > Issue Type: Sub-task > Components: docs >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8582.001.patch, YARN-8582.002.patch, > YARN-8582.003.patch, YARN-8582.004.patch > > > Documentation for YARN-6586. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile
[ https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652386#comment-16652386 ] Hadoop QA commented on YARN-8810: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 28s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 25s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 17s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 13m 10s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 24s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 68m 35s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:4b8c2b1 | | JIRA Issue | YARN-8810 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12944194/YARN-8810.002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 5354df747e22 3.13.0-153-generic #203-Ubuntu SMP Thu Jun 14 08:52:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / d59ca43 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_181 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/22206/testReport/ | | Max. process+thread count | 753 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/22206/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > Yarn Service:
[jira] [Commented] (YARN-8448) AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652392#comment-16652392 ] Haibo Chen commented on YARN-8448: -- I ran the cestest locally and it did not fail for me either. +1 on the latest patch. Will check it in shortly. > AM HTTPS Support > > > Key: YARN-8448 > URL: https://issues.apache.org/jira/browse/YARN-8448 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8448.001.patch, YARN-8448.002.patch, > YARN-8448.003.patch, YARN-8448.004.patch, YARN-8448.005.patch, > YARN-8448.006.patch, YARN-8448.007.patch, YARN-8448.008.patch, > YARN-8448.009.patch, YARN-8448.010.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8842) Expose metrics for custom resource types in QueueMetrics
[ https://issues.apache.org/jira/browse/YARN-8842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haibo Chen updated YARN-8842: - Summary: Expose metrics for custom resource types in QueueMetrics (was: Update QueueMetrics with custom resource values ) > Expose metrics for custom resource types in QueueMetrics > > > Key: YARN-8842 > URL: https://issues.apache.org/jira/browse/YARN-8842 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-8842.001.patch, YARN-8842.002.patch, > YARN-8842.003.patch, YARN-8842.004.patch, YARN-8842.005.patch, > YARN-8842.006.patch, YARN-8842.007.patch, YARN-8842.008.patch, > YARN-8842.009.patch, YARN-8842.010.patch, YARN-8842.011.patch, > YARN-8842.012.patch > > > This is the 2nd dependent jira of YARN-8059. > As updating the metrics is an independent step from handling preemption, this > jira only deals with the queue metrics update of custom resources. > The following metrics should be updated: > * allocated resources > * available resources > * pending resources > * reserved resources > * aggregate seconds preempted -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
[ https://issues.apache.org/jira/browse/YARN-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Botong Huang updated YARN-8893: --- Attachment: YARN-8893.v1.patch > [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client > --- > > Key: YARN-8893 > URL: https://issues.apache.org/jira/browse/YARN-8893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: amrmproxy, federation >Reporter: Botong Huang >Assignee: Botong Huang >Priority: Major > Attachments: YARN-8893.v1.patch > > > Fix thread leak in AMRMClientRelayer and UAM client used by > FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8449) RM HA for AM HTTPS Support
[ https://issues.apache.org/jira/browse/YARN-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652461#comment-16652461 ] Robert Kanter commented on YARN-8449: - The 001 patch adds the code for storing and retrieving the CA Certificate and Private Key in and from the {{RMStateStore}}. The design doc had mentioned also storing the public key, but that's not necessary because the public key can be obtained from the Certificate. The code mirrors the way existing things interact with the {{RMStateStore}}. Also added/updated tests. > RM HA for AM HTTPS Support > -- > > Key: YARN-8449 > URL: https://issues.apache.org/jira/browse/YARN-8449 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Robert Kanter >Assignee: Robert Kanter >Priority: Major > Attachments: YARN-8449.001.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8810) Yarn Service: discrepancy between hashcode and equals of ConfigFile
[ https://issues.apache.org/jira/browse/YARN-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652631#comment-16652631 ] Hudson commented on YARN-8810: -- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #15234 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/15234/]) YARN-8810. Fixed a YARN service bug in comparing ConfigFile object. (eyang: rev 3bfd214a59a60263aff67850c4d646c64fd76a01) * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/api/records/ConfigFile.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/main/java/org/apache/hadoop/yarn/service/UpgradeComponentsFinder.java * (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core/src/test/java/org/apache/hadoop/yarn/service/TestDefaultUpgradeComponentsFinder.java > Yarn Service: discrepancy between hashcode and equals of ConfigFile > --- > > Key: YARN-8810 > URL: https://issues.apache.org/jira/browse/YARN-8810 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Minor > Fix For: 3.2.0, 3.1.2, 3.3.0 > > Attachments: YARN-8810.001.patch, YARN-8810.002.patch > > > The {{ConfigFile}} class {{equals}} method doesn't check the equality of > {{properties}}. The {{hashCode}} does include the {{properties}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652684#comment-16652684 ] Eric Yang commented on YARN-8489: - {quote}If it is never, dominant field will be ignored. Otherwise dominant field is allowed.{quote} If we go by what you proposed, user expectation of dominant field and restart policy will not be right. Earlier comment was proposing to clean up other components, when the dominant component finished. The dominant component could be a batch job that should not be repeated. Ignore does not sound like the right solution here. Dependent component state changed to FAILED to signal other components to terminate seems like a more intuitive approach to address the state transition problem. This will ensure restart policy or upgrade trigged state change requires no addition insertion of logic to safe guard dominant component. {quote} Component.state: - Transition to SUCCEEDED && component.dominant == true: Set service state to SUCCEEDED. - Transition to FAILED && component.dominant == true. Set service state to FAILED. {quote} This looks like you want the service to report successful state or failure state based on the "important" component status instead of every component report SUCCEEDED to get service state SUCCEEDED. A safer approach to enable this logic is to have a boolean flag in component level to indicate "report_as_service_state":true. This requires no alteration to state transition logic, but add a check in the end. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8893) [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client
Botong Huang created YARN-8893: -- Summary: [AMRMProxy] Fix thread leak in AMRMClientRelayer and UAM client Key: YARN-8893 URL: https://issues.apache.org/jira/browse/YARN-8893 Project: Hadoop YARN Issue Type: Task Reporter: Botong Huang Assignee: Botong Huang Fix thread leak in AMRMClientRelayer and UAM client used by FederationInterceptor, when destroying the interceptor pipeline in AMRMProxy. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8892) YARN UI2 doc changes to update security status (verified under security environment)
[ https://issues.apache.org/jira/browse/YARN-8892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-8892: - Summary: YARN UI2 doc changes to update security status (verified under security environment) (was: YARN UI2 doc improvement to update security status) > YARN UI2 doc changes to update security status (verified under security > environment) > > > Key: YARN-8892 > URL: https://issues.apache.org/jira/browse/YARN-8892 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Sunil Govindan >Assignee: Sunil Govindan >Priority: Blocker > Attachments: YARN-8892.001.patch > > > UI2 is now tested under kerberized env as well. update this in the doc -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org