[jira] [Updated] (YARN-9295) [UI2] Fix 'Decomissioned' label typo in Cluster Overview page
[ https://issues.apache.org/jira/browse/YARN-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akhil PB updated YARN-9295: --- Summary: [UI2] Fix 'Decomissioned' label typo in Cluster Overview page (was: Fix 'Decomissioned' label typo in Cluster Overview page) > [UI2] Fix 'Decomissioned' label typo in Cluster Overview page > - > > Key: YARN-9295 > URL: https://issues.apache.org/jira/browse/YARN-9295 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Charan Hebri >Assignee: Charan Hebri >Priority: Trivial > Attachments: Decommissioned-typo.png, YARN-9295.001.patch > > > Change label text from 'Decomissioned' to 'Decommissioned' in Node Managers > section of the Cluster Overview page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9302) make maxAssign configurable at NM side
[ https://issues.apache.org/jira/browse/YARN-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaohui Xin updated YARN-9302: -- Description: I think it's more flexible to make maxAssign configurable at NM side. After that, we can assign different amount of containers. (was: I think it's more flexible to make maxAssign configurable at NM side. ) > make maxAssign configurable at NM side > -- > > Key: YARN-9302 > URL: https://issues.apache.org/jira/browse/YARN-9302 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > I think it's more flexible to make maxAssign configurable at NM side. After > that, we can assign different amount of containers. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9302) make maxAssign configurable at NM side
[ https://issues.apache.org/jira/browse/YARN-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaohui Xin updated YARN-9302: -- Description: I think it's more flexible to make maxAssign configurable at NM side. (was: I think it's more flexible to config) > make maxAssign configurable at NM side > -- > > Key: YARN-9302 > URL: https://issues.apache.org/jira/browse/YARN-9302 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > I think it's more flexible to make maxAssign configurable at NM side. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9300) Lazy preemption should trigger an update on queue preemption metrics for CapacityScheduler
[ https://issues.apache.org/jira/browse/YARN-9300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-9300: --- Attachment: YARN-9300.001.patch > Lazy preemption should trigger an update on queue preemption metrics for > CapacityScheduler > -- > > Key: YARN-9300 > URL: https://issues.apache.org/jira/browse/YARN-9300 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler >Affects Versions: 3.2.2 >Reporter: Tao Yang >Assignee: Tao Yang >Priority: Major > Attachments: YARN-9300.001.patch > > > Currently lazy preemption can't trigger an update on queue preemption metrics > since the update is only called in > CapacityScheduler#completedContainerInternal which is not the only way to be > passed for all container completions. > This issue plans to move this update to LeafQueue#completedContainer to > trigger an update on queue preemption metrics for all container completions > because of preemption. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9302) make maxAssign configurable at NM side
[ https://issues.apache.org/jira/browse/YARN-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaohui Xin updated YARN-9302: -- Description: I think it's more flexible to config > make maxAssign configurable at NM side > -- > > Key: YARN-9302 > URL: https://issues.apache.org/jira/browse/YARN-9302 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > > I think it's more flexible to config -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-9302) make maxAssign configurable at NM side
[ https://issues.apache.org/jira/browse/YARN-9302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaohui Xin reassigned YARN-9302: - Assignee: Zhaohui Xin > make maxAssign configurable at NM side > -- > > Key: YARN-9302 > URL: https://issues.apache.org/jira/browse/YARN-9302 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhaohui Xin >Assignee: Zhaohui Xin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9302) make maxAssign configurable at NM side
Zhaohui Xin created YARN-9302: - Summary: make maxAssign configurable at NM side Key: YARN-9302 URL: https://issues.apache.org/jira/browse/YARN-9302 Project: Hadoop YARN Issue Type: Improvement Reporter: Zhaohui Xin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9299) TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors
[ https://issues.apache.org/jira/browse/YARN-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767889#comment-16767889 ] Prabhu Joseph commented on YARN-9299: - [~rohithsharma] Can you review the patch for this jira - this fixes TestTimelineReaderWhitelistAuthorizationFilter positive test cases to make sure there is no SC_FORBIDDEN thrown from TimelineReaderWhitelistAuthorizationFilter. > TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors > -- > > Key: YARN-9299 > URL: https://issues.apache.org/jira/browse/YARN-9299 > Project: Hadoop YARN > Issue Type: Test >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9299-001.patch > > > TestTimelineReaderWhitelistAuthorizationFilter positive test cases does not > check if there is any Error in HttpResponse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9299) TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors
[ https://issues.apache.org/jira/browse/YARN-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-9299: Affects Version/s: 3.1.2 > TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors > -- > > Key: YARN-9299 > URL: https://issues.apache.org/jira/browse/YARN-9299 > Project: Hadoop YARN > Issue Type: Test >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9299-001.patch > > > TestTimelineReaderWhitelistAuthorizationFilter positive test cases does not > check if there is any Error in HttpResponse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9301) Too many InvalidStateTransitionException with SLS
Bibin A Chundatt created YARN-9301: -- Summary: Too many InvalidStateTransitionException with SLS Key: YARN-9301 URL: https://issues.apache.org/jira/browse/YARN-9301 Project: Hadoop YARN Issue Type: Bug Reporter: Bibin A Chundatt Too many InvalidStateTransistionExcetion {noformat} 19/02/13 17:44:43 ERROR rmcontainer.RMContainerImpl: Can't handle this event at current state org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: LAUNCHED at RUNNING at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:483) at org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:65) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.containerLaunchedOnNode(SchedulerApplicationAttempt.java:655) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.containerLaunchedOnNode(AbstractYarnScheduler.java:359) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNewContainerInfo(AbstractYarnScheduler.java:1010) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.nodeUpdate(AbstractYarnScheduler.java:1112) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1295) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1752) at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:205) at org.apache.hadoop.yarn.sls.scheduler.SLSCapacityScheduler.handle(SLSCapacityScheduler.java:60) at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66) at java.lang.Thread.run(Thread.java:745) 19/02/13 17:44:43 ERROR rmcontainer.RMContainerImpl: Invalid event LAUNCHED on container container_1550059705491_0067_01_01 {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-9300) Lazy preemption should trigger an update on queue preemption metrics for CapacityScheduler
Tao Yang created YARN-9300: -- Summary: Lazy preemption should trigger an update on queue preemption metrics for CapacityScheduler Key: YARN-9300 URL: https://issues.apache.org/jira/browse/YARN-9300 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.2.2 Reporter: Tao Yang Assignee: Tao Yang Currently lazy preemption can't trigger an update on queue preemption metrics since the update is only called in CapacityScheduler#completedContainerInternal which is not the only way to be passed for all container completions. This issue plans to move this update to LeafQueue#completedContainer to trigger an update on queue preemption metrics for all container completions because of preemption. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767790#comment-16767790 ] Eric Yang commented on YARN-8927: - [~ebadger] I think it's still admin mistake because the repository name can be preconfigured to a host in local domain which would have no chance to contact docker hub even if a repository is later setup to try to impersonate. YARN's trusted registry acl can avoid untrusted docker hub repository. The discussion is digressing. I agree that adding the local image white list can tighten security further for images without '/' characters or used. This jira can't solve docker run pulling remote image when image is absent or remote image name is identical to local image name. [~csingh] is solving the docker image localization issues in YARN-9228. It may help to solve precheck of image existence in her story instead. > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767744#comment-16767744 ] Eric Badger commented on YARN-8927: --- This isn't an admin mistakenly naming their local image the same as a repository on dockerhub. The admin will name their local images something and then after that a nefarious actor will upload a malicious image to that same location in dockerhub. Unless you are assuming that dockerhub is to be a trusted source, which I don't think it can be. As for avoiding this issue by using a private repository, this is not possible as Docker refuses to remove docker.io from the default registry list (https://github.com/moby/moby/issues/33069). So docker.io will always be the fallback if the image does not exist locally. Again, I would love it if Docker would just allow for you to remove default registries or add a --no-pull flag or similar to the run command. But, since they are not and will not do those, we have to mitigate in other ways to avoid bad apples who can push malicious images to dockerhub. > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767703#comment-16767703 ] Eric Yang commented on YARN-8927: - [~ebadger] I don't think there is a way to prevent docker run to pull a image that admin has mistakenly named local images that matches repository on docker hub, then having the image absent locally. The chance of this happening is rare and can be avoided by using private repository host/port to avoid contacting docker hub. I like to avoid conflating admin mistakes (usability problem) and actual security problem for this jira to move forward. > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767645#comment-16767645 ] Eric Badger commented on YARN-8927: --- ARN-9184 deals with explicit pulls. However, docker will do an implicit pull during {{docker run}} if the image does not exist locally. YARN-9184 seems to deal with explicitly pulling (or not pulling) images before the container is launched. > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767645#comment-16767645 ] Eric Badger edited comment on YARN-8927 at 2/13/19 10:25 PM: - YARN-9184 deals with explicit pulls. However, docker will do an implicit pull during {{docker run}} if the image does not exist locally. YARN-9184 seems to deal with explicitly pulling (or not pulling) images before the container is launched. was (Author: ebadger): ARN-9184 deals with explicit pulls. However, docker will do an implicit pull during {{docker run}} if the image does not exist locally. YARN-9184 seems to deal with explicitly pulling (or not pulling) images before the container is launched. > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767636#comment-16767636 ] Eric Yang commented on YARN-8927: - [~ebadger] {quote}If we are assuming that Dockerhub and any other default registry is untrusted (we should), then the assumption has to be that any image by any name can be published. Let's say I tag a local image as hadoop/myimage:latest on every node in my cluster. We have to assume that there could be a repo within the default registry named hadoop with an image named myimage:latest. This doesn't make my local image hadoop/myimage:latest any less of a local image, but it also means that there is an image in Dockerhub by the same name which will be pulled if, for whatever reason, my local image was deleted, not uploaded yet, etc.{quote} The last point is covered by YARN-9184. Can you confirm? > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767583#comment-16767583 ] Eric Badger commented on YARN-8927: --- {quote} It seems if a user wants lcoal image "repoA/userA/imageA" to be allowed, he/she should configure "repoA/userA" in the "docker.trusted.registries"? I will try if this works and get back to you. {quote} It's not about wanting repoA/userA/imageA to be allowed. That is an easy problem to solve as you have described. The hard part is allowing repoA/userA/imageA to be allowed _only_ if it exists locally. {quote} And one thing worthing noting is that if YARN allows an image name, then Docker will check if it's local and prefer to run it before pulling from a hub. YARN's checking logic here seems duplicated work because if Docker can pull it and run. We can hardly say this "repoA/userA/imageA" is a real local image. {quote} If we are assuming that Dockerhub and any other default registry is untrusted (we should), then the assumption has to be that any image by any name can be published. Let's say I tag a local image as {{hadoop/myimage:latest}} on every node in my cluster. We have to assume that there could be a repo within the default registry named {{hadoop}} with an image named {{myimage:latest}}. This doesn't make my local image {{hadoop/myimage:latest}} any less of a local image, but it also means that there is an image in Dockerhub by the same name which will be pulled if, for whatever reason, my local image was deleted, not uploaded yet, etc. > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9299) TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors
[ https://issues.apache.org/jira/browse/YARN-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767473#comment-16767473 ] Hadoop QA commented on YARN-9299: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 11s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 35s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 13s{color} | {color:green} hadoop-yarn-server-timelineservice in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 51m 10s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9299 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12958611/YARN-9299-001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a17230023736 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 29b411d | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23398/testReport/ | | Max. process+thread count | 336 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-timelineservice | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23398/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > TestTimelineReaderWhitelis
[jira] [Created] (YARN-9299) TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors
Prabhu Joseph created YARN-9299: --- Summary: TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors Key: YARN-9299 URL: https://issues.apache.org/jira/browse/YARN-9299 Project: Hadoop YARN Issue Type: Test Reporter: Prabhu Joseph Assignee: Prabhu Joseph TestTimelineReaderWhitelistAuthorizationFilter positive test cases does not check if there is any Error in HttpResponse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9299) TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors
[ https://issues.apache.org/jira/browse/YARN-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prabhu Joseph updated YARN-9299: Attachment: YARN-9299-001.patch > TestTimelineReaderWhitelistAuthorizationFilter ignores Http Errors > -- > > Key: YARN-9299 > URL: https://issues.apache.org/jira/browse/YARN-9299 > Project: Hadoop YARN > Issue Type: Test >Reporter: Prabhu Joseph >Assignee: Prabhu Joseph >Priority: Major > Attachments: YARN-9299-001.patch > > > TestTimelineReaderWhitelistAuthorizationFilter positive test cases does not > check if there is any Error in HttpResponse. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767366#comment-16767366 ] Hadoop QA commented on YARN-9118: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 40s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 13s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 54s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 19s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 4 new + 11 unchanged - 5 fixed = 15 total (was 16) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 9s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 52s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 66m 43s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9118 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12958591/YARN-9118.008.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f51f23e03480 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 00c5ffa | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/23397/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23397/testReport/ | | Max. process+thread count | 447 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanage
[jira] [Commented] (YARN-8927) Support trust top-level image like "centos" when "library" is configured in "docker.trusted.registries"
[ https://issues.apache.org/jira/browse/YARN-8927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767370#comment-16767370 ] Eric Yang commented on YARN-8927: - [~tangzhankun] when "library" is configured, and there is a local image named black. This is not a top level image. This image is trusted by default. In [~ebadger]'s environment, local trusted image is tagged with "repoA/imageA". Patch 002 breaks his trust list because top level images are trusted, but untagged image name black is also trusted. This is the reason that he ask for a local image white list to prevent local image like black to be trusted. Is this something that can be enhanced in the condition that checks for library and '/'? It would be possible to add a white list here to tighten security. > Support trust top-level image like "centos" when "library" is configured in > "docker.trusted.registries" > --- > > Key: YARN-8927 > URL: https://issues.apache.org/jira/browse/YARN-8927 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Zhankun Tang >Assignee: Zhankun Tang >Priority: Major > Labels: Docker > Attachments: YARN-8927-trunk.001.patch, YARN-8927-trunk.002.patch > > > There are some missing cases that we need to catch when handling > "docker.trusted.registries". > The container-executor.cfg configuration is as follows: > {code:java} > docker.trusted.registries=tangzhankun,ubuntu,centos{code} > It works if run DistrubutedShell with "tangzhankun/tensorflow" > {code:java} > "yarn ... -shell_env YARN_CONTAINER_RUNTIME_TYPE=docker -shell_env > YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=tangzhankun/tensorflow > {code} > But run a DistrubutedShell job with "centos", "centos[:tagName]", "ubuntu" > and "ubuntu[:tagName]" fails: > The error message is like: > {code:java} > "image: centos is not trusted" > {code} > We need better handling the above cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767351#comment-16767351 ] Peter Bacsko commented on YARN-9118: "If I put those method names into a newline, it looks really weird" Just use {{@SuppressWarnings("checkstyle:linelength")}} if it doesn't make sense. > Handle issues with parsing user defined GPU devices in GpuDiscoverer > > > Key: YARN-9118 > URL: https://issues.apache.org/jira/browse/YARN-9118 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9118.001.patch, YARN-9118.002.patch, > YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, > YARN-9118.006.patch, YARN-9118.007.patch, YARN-9118.008.patch > > > getGpusUsableByYarn has the following issues: > - Duplicate GPU device definitions are not denied: This seems to be the > biggest issue as it could increase the number of devices on the node if the > device ID is defined 2 or more times. > - An empty-string is accepted, it works like the user would not want to use > auto-discovery and haven't defined any GPU devices: This will result in an > empty device list, but the empty-string check is never explicitly there in > the code, so this behavior just coincidental. > - Number validation does not happen on GPU device IDs (separated by commas) > Many testcases are added as the coverage was already very low. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767351#comment-16767351 ] Peter Bacsko edited comment on YARN-9118 at 2/13/19 4:03 PM: - "If I put those method names into a newline, it looks really weird" Just use {{@SuppressWarnings("checkstyle:linelength")}} if that's the case was (Author: pbacsko): "If I put those method names into a newline, it looks really weird" Just use {{@SuppressWarnings("checkstyle:linelength")}} if it doesn't make sense. > Handle issues with parsing user defined GPU devices in GpuDiscoverer > > > Key: YARN-9118 > URL: https://issues.apache.org/jira/browse/YARN-9118 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9118.001.patch, YARN-9118.002.patch, > YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, > YARN-9118.006.patch, YARN-9118.007.patch, YARN-9118.008.patch > > > getGpusUsableByYarn has the following issues: > - Duplicate GPU device definitions are not denied: This seems to be the > biggest issue as it could increase the number of devices on the node if the > device ID is defined 2 or more times. > - An empty-string is accepted, it works like the user would not want to use > auto-discovery and haven't defined any GPU devices: This will result in an > empty device list, but the empty-string check is never explicitly there in > the code, so this behavior just coincidental. > - Number validation does not happen on GPU device IDs (separated by commas) > Many testcases are added as the coverage was already very low. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767300#comment-16767300 ] Szilard Nemeth commented on YARN-9118: -- Hi [~tangzhankun]! Fixed some of the checkstyle issues with patch008. Some of them does not make sense for me: - Missing package-info: Is this really required? - I had 2 lines are longer than 80 chars issues in GpuDeviceSpecificationException: If I put those method names into a newline, it looks really weird. - 'conf' hides a field: Does this have any value to rename the parameter? Are you fine with not fixing the issues listed above? [~pbacsko]: Extracted the creation of the configuration objects into the method with the latest patch. > Handle issues with parsing user defined GPU devices in GpuDiscoverer > > > Key: YARN-9118 > URL: https://issues.apache.org/jira/browse/YARN-9118 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9118.001.patch, YARN-9118.002.patch, > YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, > YARN-9118.006.patch, YARN-9118.007.patch, YARN-9118.008.patch > > > getGpusUsableByYarn has the following issues: > - Duplicate GPU device definitions are not denied: This seems to be the > biggest issue as it could increase the number of devices on the node if the > device ID is defined 2 or more times. > - An empty-string is accepted, it works like the user would not want to use > auto-discovery and haven't defined any GPU devices: This will result in an > empty device list, but the empty-string check is never explicitly there in > the code, so this behavior just coincidental. > - Number validation does not happen on GPU device IDs (separated by commas) > Many testcases are added as the coverage was already very low. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9098) Separate mtab file reader code and cgroups file system hierarchy parser code from CGroupsHandlerImpl and ResourceHandlerModule
[ https://issues.apache.org/jira/browse/YARN-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767298#comment-16767298 ] Peter Bacsko commented on YARN-9098: Maybe it's just nitpicking, but... {noformat} public List getPathsForController(String controller) { return mappings.entrySet().stream() .filter(e -> e.getValue().contains(controller)) .map(Map.Entry::getKey) .collect(Collectors.toList()); } {noformat} Is it ok to use {{contains()}} here? If cpu and cpuacct are mounted to two different directories, then we might return wrong path for cpu, no? Usually they're mounted to the same directory like {{/sys/fs/cgroup/cpu,cpuacct}} but it's something to think about. > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > -- > > Key: YARN-9098 > URL: https://issues.apache.org/jira/browse/YARN-9098 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9098.002.patch, YARN-9098.003.patch, > YARN-9098.004.patch, YARN-9098.005.patch, YARN-9098.006.patch > > > Separate mtab file reader code and cgroups file system hierarchy parser code > from CGroupsHandlerImpl and ResourceHandlerModule > CGroupsHandlerImpl has a method parseMtab that parses an mtab file and stores > cgroups data. > CGroupsLCEResourcesHandler also has a method with the same name, with > identical code. > The parser code should be extracted from these places and be added in a new > class as this is a separate responsibility. > As the output of the file parser is a Map>, it's better > to encapsulate it in a domain object, named 'CGroupsMountConfig' for instance. > ResourceHandlerModule has a method named parseConfiguredCGroupPath, that is > responsible for producing the same results (Map>) to > store cgroups data, it does not operate on mtab file, but looking at the > filesystem for cgroup settings. As the output is the same, CGroupsMountConfig > should be used here, too. > Again, this could should not be part of ResourceHandlerModule as it is a > different responsibility. > One more thing which is strongly related to the methods above is > CGroupsHandlerImpl.initializeFromMountConfig: This method processes the > result of a parsed mtab file or a parsed cgroups filesystem data and stores > file system paths for all available controllers. This method invokes > findControllerPathInMountConfig, which is a duplicated in CGroupsHandlerImpl > and CGroupsLCEResourcesHandler, so it should be moved to a single place. To > store filesystem path and controller mappings, a new domain object could be > introduced. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth updated YARN-9118: - Attachment: YARN-9118.008.patch > Handle issues with parsing user defined GPU devices in GpuDiscoverer > > > Key: YARN-9118 > URL: https://issues.apache.org/jira/browse/YARN-9118 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9118.001.patch, YARN-9118.002.patch, > YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, > YARN-9118.006.patch, YARN-9118.007.patch, YARN-9118.008.patch > > > getGpusUsableByYarn has the following issues: > - Duplicate GPU device definitions are not denied: This seems to be the > biggest issue as it could increase the number of devices on the node if the > device ID is defined 2 or more times. > - An empty-string is accepted, it works like the user would not want to use > auto-discovery and haven't defined any GPU devices: This will result in an > empty device list, but the empty-string check is never explicitly there in > the code, so this behavior just coincidental. > - Number validation does not happen on GPU device IDs (separated by commas) > Many testcases are added as the coverage was already very low. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9123) Clean up and split testcases in TestNMWebServices for GPU support
[ https://issues.apache.org/jira/browse/YARN-9123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767291#comment-16767291 ] Peter Bacsko commented on YARN-9123: " testGetNMResourceInfoFailBecauseOfUnknownPlugin is a bit lengthy: 47 character." I think this is fine (seen much worse). Another name could be sth like {{testGetNMResourceInfoWhenPluginIsUnknown}} which is also a popular naming scheme (I mean using "when"). Talking about repetitions, this could be extracted too: {noformat} ClientResponse response = r.path("ws").path("v1").path("node").path( "resources").path("resource-2").accept(MediaType.APPLICATION_JSON).get( ClientResponse.class); {noformat} > Clean up and split testcases in TestNMWebServices for GPU support > - > > Key: YARN-9123 > URL: https://issues.apache.org/jira/browse/YARN-9123 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Minor > Attachments: YARN-9123.001.patch, YARN-9123.002.patch, > YARN-9123.003.patch, YARN-9123.004.patch > > > The following testcases can be cleaned up a bit: > TestNMWebServices#testGetNMResourceInfo - Can be split up to 3 different cases > TestNMWebServices#testGetYarnGpuResourceInfo -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9135) NM State store ResourceMappings serialization are tested with Strings instead of real Device objects
[ https://issues.apache.org/jira/browse/YARN-9135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767287#comment-16767287 ] Peter Bacsko commented on YARN-9135: Thanks for updating the patch [~snemeth]. Please make sure that these methods return a standard {{Map}} instead of {{ImmutableMap}} (the more generic the better). {{public ImmutableMap getNodeVsCpus()}} {{public ImmutableMap getNodeVsCpus()}} > NM State store ResourceMappings serialization are tested with Strings instead > of real Device objects > > > Key: YARN-9135 > URL: https://issues.apache.org/jira/browse/YARN-9135 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9135.001.patch, YARN-9135.003.patch, > YARN-9135.004.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9133) Make tests more easy to comprehend in TestGpuResourceHandler
[ https://issues.apache.org/jira/browse/YARN-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767279#comment-16767279 ] Peter Bacsko commented on YARN-9133: +1 (non-binding) > Make tests more easy to comprehend in TestGpuResourceHandler > > > Key: YARN-9133 > URL: https://issues.apache.org/jira/browse/YARN-9133 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9133.001.patch, YARN-9133.001.patch, > YARN-9133.002.patch, YARN-9133.003.patch, YARN-9133.004.patch, > YARN-9133.005.patch > > > Tests are not quite easy to read: > - Some more helper methods would improve readability. > - Eliminating the boolean flag that controls if docker is used would also > improve readability and clarity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9138) Test error handling of nvidia-smi binary execution of GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767263#comment-16767263 ] Peter Bacsko commented on YARN-9138: [~snemeth] now you can remove this unnecessary code-paths: {noformat} if (Shell.WINDOWS) { ... } else { ... {noformat} > Test error handling of nvidia-smi binary execution of GpuDiscoverer > --- > > Key: YARN-9138 > URL: https://issues.apache.org/jira/browse/YARN-9138 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9138.001.patch, YARN-9138.002.patch > > > The code that executes nvidia-smi (doing GPU device auto-discovery) don't > have much test coverage. > This patch adds tests to this part of the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9138) Test error handling of nvidia-smi binary execution of GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767263#comment-16767263 ] Peter Bacsko edited comment on YARN-9138 at 2/13/19 2:47 PM: - [~snemeth] 1. now you can remove these unnecessary code-paths: {noformat} if (Shell.WINDOWS) { ... } else { ... {noformat} 2. OK, I know this is annoying, but could you static import assert calls? We use it everywhere else, so let's be consistent. 3. String "PATH" is used multiple times, it's worth making it static final. Same applies to "u+x". was (Author: pbacsko): [~snemeth] now you can remove these unnecessary code-paths: {noformat} if (Shell.WINDOWS) { ... } else { ... {noformat} > Test error handling of nvidia-smi binary execution of GpuDiscoverer > --- > > Key: YARN-9138 > URL: https://issues.apache.org/jira/browse/YARN-9138 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9138.001.patch, YARN-9138.002.patch > > > The code that executes nvidia-smi (doing GPU device auto-discovery) don't > have much test coverage. > This patch adds tests to this part of the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-9138) Test error handling of nvidia-smi binary execution of GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767263#comment-16767263 ] Peter Bacsko edited comment on YARN-9138 at 2/13/19 2:38 PM: - [~snemeth] now you can remove these unnecessary code-paths: {noformat} if (Shell.WINDOWS) { ... } else { ... {noformat} was (Author: pbacsko): [~snemeth] now you can remove this unnecessary code-paths: {noformat} if (Shell.WINDOWS) { ... } else { ... {noformat} > Test error handling of nvidia-smi binary execution of GpuDiscoverer > --- > > Key: YARN-9138 > URL: https://issues.apache.org/jira/browse/YARN-9138 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9138.001.patch, YARN-9138.002.patch > > > The code that executes nvidia-smi (doing GPU device auto-discovery) don't > have much test coverage. > This patch adds tests to this part of the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9139) Simplify initializer code of GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767257#comment-16767257 ] Peter Bacsko commented on YARN-9139: [~snemeth] 1. Please fix the remaining checkstyle issues 2. Why is {{TestFpgaDiscoverer}} class is referenced in {{TestGpuResourceHandler.java}} ? 3. Repeated use of {{Configuration conf = createDefaultConfig();}} - extract {{conf}} to a class variable and initialize once > Simplify initializer code of GpuDiscoverer > -- > > Key: YARN-9139 > URL: https://issues.apache.org/jira/browse/YARN-9139 > Project: Hadoop YARN > Issue Type: Improvement >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9139.001.patch, YARN-9139.002.patch, > YARN-9139.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8295) [UI2] The "Resource Usage" tab is pointless for finished applications
[ https://issues.apache.org/jira/browse/YARN-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767246#comment-16767246 ] Hadoop QA commented on YARN-8295: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 33s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 28m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 58s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 26s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 42m 11s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-8295 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12958565/YARN-8295.001.patch | | Optional Tests | dupname asflicense shadedclient | | uname | Linux c9389175cbbe 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 17:16:02 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 00c5ffa | | maven | version: Apache Maven 3.3.9 | | Max. process+thread count | 445 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/23396/console | | Powered by | Apache Yetus 0.8.0 http://yetus.apache.org | This message was automatically generated. > [UI2] The "Resource Usage" tab is pointless for finished applications > - > > Key: YARN-8295 > URL: https://issues.apache.org/jira/browse/YARN-8295 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Reporter: Gergely Novák >Assignee: Charan Hebri >Priority: Minor > Attachments: YARN-8295.001.patch > > > If the user goes to Applications -> app -> Resource Usage for a finished > application, they get this message: "No resource usage data is available for > this application!". > I think it would be better to hide this tab for finished applications, or at > least add something like "this application is not using any resources because > it is finished" to the message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9118) Handle issues with parsing user defined GPU devices in GpuDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767226#comment-16767226 ] Peter Bacsko commented on YARN-9118: Minor: {{Configuration conf = new Configuration(false);}} - this line keeps repeating in the tests. How about making {{conf}} a class variable and instantiating it in {{setup()}}? Otherwise +1 non-binding. > Handle issues with parsing user defined GPU devices in GpuDiscoverer > > > Key: YARN-9118 > URL: https://issues.apache.org/jira/browse/YARN-9118 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Szilard Nemeth >Assignee: Szilard Nemeth >Priority: Major > Attachments: YARN-9118.001.patch, YARN-9118.002.patch, > YARN-9118.003.patch, YARN-9118.004.patch, YARN-9118.005.patch, > YARN-9118.006.patch, YARN-9118.007.patch > > > getGpusUsableByYarn has the following issues: > - Duplicate GPU device definitions are not denied: This seems to be the > biggest issue as it could increase the number of devices on the node if the > device ID is defined 2 or more times. > - An empty-string is accepted, it works like the user would not want to use > auto-discovery and haven't defined any GPU devices: This will result in an > empty device list, but the empty-string check is never explicitly there in > the code, so this behavior just coincidental. > - Number validation does not happen on GPU device IDs (separated by commas) > Many testcases are added as the coverage was already very low. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767222#comment-16767222 ] Hadoop QA commented on YARN-9270: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 18s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 9s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 23s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: The patch generated 5 new + 143 unchanged - 12 fixed = 148 total (was 155) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 43s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 20m 50s{color} | {color:green} hadoop-yarn-server-nodemanager in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 73m 28s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:8f97d6f | | JIRA Issue | YARN-9270 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12958563/YARN-9270-002.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux eb7c1cfa7a5e 4.4.0-138-generic #164~14.04.1-Ubuntu SMP Fri Oct 5 08:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 00c5ffa | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_191 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/23395/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/23395/testReport/ | | Max. process+thread count | 339 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server
[jira] [Commented] (YARN-9217) Nodemanager will fail to start if GPU is misconfigured on the node or GPU drivers missing
[ https://issues.apache.org/jira/browse/YARN-9217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767203#comment-16767203 ] Peter Bacsko commented on YARN-9217: Minor comments: 1. Do we need a separate variable here? {noformat} 70 if (usableGpus.isEmpty()) { 71String message = "GPU is enabled on the NodeManager, but couldn't find " 72+ "any usable GPU devices, please double check configuration."; 73LOG.warn(message); {noformat} 2. Similar thing in GpuNodeResourceUpdateHandler {noformat} if (usableGpus.isEmpty()) { String message = "GPU is enabled, but couldn't find any usable GPUs on the " + "NodeManager."; LOG.warn(message); {noformat} 3. I would rename {{checkErrorNumber()}} to {{checkErrorCount()}} 4. By the way -- is it reasonable to perform GPU discovery in a loop? What's the idea here? Is "nvidia-smi" flaky sometimes? What condition are we trying to avoid? I realized that this part of the code existed before, but still... anyone? :) 5. {{NvidiaBinaryHelper}} - {{@returns}} clause is missing in the JavaDoc 6. {{NvidiaBinaryHelper}} - this class is very small. If it's introduced for testing purposes, I strongly recommend using a replaceable lamba function, like this: {noformat} Function> gpuDeviceRetriever = this::getGpuDeviceInformation; ... @VisibleForTesting void setGpuDeviceRetriever(Function> func) { this.gpuDeviceRetriever = func; } ... lastDiscoveredGpuInformation = gpuDeviceRetriever.apply(pathOfGpuBinary); {noformat} Then you can set your own retrieving logic in the test. Lambdas can't throw exceptions, so you have to wrap incorrect return values in {{Optional}}. *Fundamental question*: is this the way how we want to use thig plugin? Just asking because we might accidentally mask erratic behavior. Eg. a Hadoop user might think that he has a cluster with 10 GPUs. In reality, the plugin failed to detect some cards, and only 5 NMs support GPU scheduling. If it's not explicitly displayed, the user might be under the impression that 10 GPUs are ready to run YARN workloads. This can be very misleading. At the very least, a fail-fast method should be considered. > Nodemanager will fail to start if GPU is misconfigured on the node or GPU > drivers missing > - > > Key: YARN-9217 > URL: https://issues.apache.org/jira/browse/YARN-9217 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.0.0, 3.1.0 >Reporter: Antal Bálint Steinbach >Assignee: Antal Bálint Steinbach >Priority: Major > Attachments: YARN-9217.001.patch, YARN-9217.002.patch, > YARN-9217.003.patch, YARN-9217.004.patch > > > Nodemanager will not start > 1. If Autodiscovery is enabled: > * If nvidia-smi path is misconfigured or the file does not exist. > * There is 0 GPU found > * If the file exists but it is not pointing to an nvidia-smi > * if the binary is ok but there is an IOException > 2. If the manually configured GPU devices are misconfigured > * Any index:minor number format failure will cause a problem > * 0 configured device will cause a problem > * NumberFormatException is not handled > It would be a better option to add warnings about the configuration, set 0 > available GPUs and let the node work and run non-gpu jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8295) [UI2] The "Resource Usage" tab is pointless for finished applications
[ https://issues.apache.org/jira/browse/YARN-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charan Hebri updated YARN-8295: --- Attachment: YARN-8295.001.patch > [UI2] The "Resource Usage" tab is pointless for finished applications > - > > Key: YARN-8295 > URL: https://issues.apache.org/jira/browse/YARN-8295 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn-ui-v2 >Reporter: Gergely Novák >Assignee: Charan Hebri >Priority: Minor > Attachments: YARN-8295.001.patch > > > If the user goes to Applications -> app -> Resource Usage for a finished > application, they get this message: "No resource usage data is available for > this application!". > I think it would be better to hide this tab for finished applications, or at > least add something like "this application is not using any resources because > it is finished" to the message. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-1655) Add implementations to FairScheduler to support increase/decrease container resource
[ https://issues.apache.org/jira/browse/YARN-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767164#comment-16767164 ] Wilfred Spiegelenburg commented on YARN-1655: - The junit test failures are not related to this change. [~asuresh] could you please review this as you did the unifying code work? > Add implementations to FairScheduler to support increase/decrease container > resource > > > Key: YARN-1655 > URL: https://issues.apache.org/jira/browse/YARN-1655 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager, scheduler >Reporter: Wangda Tan >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-1655.001.patch, YARN-1655.002.patch, > YARN-1655.003.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767153#comment-16767153 ] Peter Bacsko commented on YARN-9270: Uploaded v2. Changes: * FpgaDiscoverer is no longer singleton * Removed unnecessary synchronized methods (checked the call hierarchy) "We request the instance of the FpgaDiscoverer 5 times, and then call the setResourceHanderPlugin on it with the same parameter (openclPlugin)" This is no longer relevant now. "Also could you move the previous comments/description of the test cases to the new tests' javadoc?" Removed those altogether. Tests are short now, should be obvious what they do. > Minor cleanup in TestFpgaDiscoverer > --- > > Key: YARN-9270 > URL: https://issues.apache.org/jira/browse/YARN-9270 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9270-001.patch, YARN-9270-002.patch > > > Let's do some cleanup in this class. > * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split > up to 5 different tests, because it tests 5 different scenarios. > * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a > {{Function}} in the plugin class like {{Function envProvider > = System::getenv()}} plus a setter method which allows the test to modify > {{envProvider}}. Much simpler and straightfoward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9298) Implement FS placement rules using PlacementRule interface
[ https://issues.apache.org/jira/browse/YARN-9298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767167#comment-16767167 ] Wilfred Spiegelenburg commented on YARN-9298: - Junit test failure seems unrelated no tests is correct those will follow with the integration into the scheduler/ > Implement FS placement rules using PlacementRule interface > -- > > Key: YARN-9298 > URL: https://issues.apache.org/jira/browse/YARN-9298 > Project: Hadoop YARN > Issue Type: Improvement > Components: scheduler >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg >Priority: Major > Attachments: YARN-9298.001.patch > > > Implement existing placement rules of the FS using the PlacementRule > interface. > Preparation for YARN-8967 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9268) Various fixes are needed in FpgaDevice
[ https://issues.apache.org/jira/browse/YARN-9268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9268: --- Description: Need to fix the following in the class {{FpgaDevice}}: * It implements {{Comparable}}, but returns 0 in every case. There is no natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but this seems too forced and unnecessary.We think this class should not implement {{Comparable}} at all, at least not like that. * Stores unnecessary fields: devName, busNum, temperature, power usage. For one, these are never needed in the code. Secondly, temp and power usage changes constantly. It's pointless to store these in this POJO. * {{serialVersionUID}} is 1L - let's generate a number for this * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor uniquely identifies the card, then let's demand them in the constructor and don't store Integers that can be null. was: Need to fix the following the class {{FpgaDevice}}: * It implements {{Comparable}}, but returns 0 in every case. There is no natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but this seems too forced and unnecessary.We think this class should not implement {{Comparable}} at all, at least not like that. * Stores unnecessary fields: devName, busNum, temperature, power usage. For one, these are never needed in the code. Secondly, temp and power usage changes constantly. It's pointless to store these in this POJO. * {{serialVersionUID}} is 1L - let's generate a number for this * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor uniquely identifies the card, then let's demand them in the constructor and don't store Integers that can be null. > Various fixes are needed in FpgaDevice > -- > > Key: YARN-9268 > URL: https://issues.apache.org/jira/browse/YARN-9268 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9268-001.patch, YARN-9268-002.patch, > YARN-9268-003.patch > > > Need to fix the following in the class {{FpgaDevice}}: > * It implements {{Comparable}}, but returns 0 in every case. There is no > natural ordering among FPGA devices, perhaps "acl0" comes before "acl1", but > this seems too forced and unnecessary.We think this class should not > implement {{Comparable}} at all, at least not like that. > * Stores unnecessary fields: devName, busNum, temperature, power usage. For > one, these are never needed in the code. Secondly, temp and power usage > changes constantly. It's pointless to store these in this POJO. > * {{serialVersionUID}} is 1L - let's generate a number for this > * Use {{int}} instead of {{Integer}} - don't allow nulls. If major/minor > uniquely identifies the card, then let's demand them in the constructor and > don't store Integers that can be null. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Bacsko updated YARN-9270: --- Attachment: YARN-9270-002.patch > Minor cleanup in TestFpgaDiscoverer > --- > > Key: YARN-9270 > URL: https://issues.apache.org/jira/browse/YARN-9270 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9270-001.patch, YARN-9270-002.patch > > > Let's do some cleanup in this class. > * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split > up to 5 different tests, because it tests 5 different scenarios. > * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a > {{Function}} in the plugin class like {{Function envProvider > = System::getenv()}} plus a setter method which allows the test to modify > {{envProvider}}. Much simpler and straightfoward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9270) Minor cleanup in TestFpgaDiscoverer
[ https://issues.apache.org/jira/browse/YARN-9270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16767085#comment-16767085 ] Peter Bacsko commented on YARN-9270: " could we remove the wildcard import import java.util.*." Certainly, let's do this in YARN-9266. "don't see why the constructor of Configuration is called with false" [...] "Also the 5th testcase (testLinuxFpgaResourceDiscoverPluginWithSdkRootSet) uses another Conifiguration object in the original testcase" I think the idea here is that the original conf object was created with "false" so that it doesn't load the default values, but in that particular test (5th), we do. I see no significant difference though. Just tried it, test result is the same. I'm also thinking about making {{FpgaDiscoverer}} non-singleton. It's much better to test that way. > Minor cleanup in TestFpgaDiscoverer > --- > > Key: YARN-9270 > URL: https://issues.apache.org/jira/browse/YARN-9270 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Peter Bacsko >Assignee: Peter Bacsko >Priority: Major > Attachments: YARN-9270-001.patch > > > Let's do some cleanup in this class. > * {{testLinuxFpgaResourceDiscoverPluginConfig}} - this test should be split > up to 5 different tests, because it tests 5 different scenarios. > * remove {{setNewEnvironmentHack()}} - too complicated. We can introduce a > {{Function}} in the plugin class like {{Function envProvider > = System::getenv()}} plus a setter method which allows the test to modify > {{envProvider}}. Much simpler and straightfoward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7977) Do ACLs check for flow activity entities
[ https://issues.apache.org/jira/browse/YARN-7977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Modi reassigned YARN-7977: --- Assignee: Abhishek Modi > Do ACLs check for flow activity entities > > > Key: YARN-7977 > URL: https://issues.apache.org/jira/browse/YARN-7977 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelinereader >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > > Verify ACLs while retrieving flow activity entities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7979) Do ACLs check for application entities
[ https://issues.apache.org/jira/browse/YARN-7979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Modi reassigned YARN-7979: --- Assignee: Abhishek Modi > Do ACLs check for application entities > -- > > Key: YARN-7979 > URL: https://issues.apache.org/jira/browse/YARN-7979 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelinereader >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > > Verify ACLs for application entities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-7981) Do ACLs check for sub app entities
[ https://issues.apache.org/jira/browse/YARN-7981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Modi reassigned YARN-7981: --- Assignee: Abhishek Modi > Do ACLs check for sub app entities > -- > > Key: YARN-7981 > URL: https://issues.apache.org/jira/browse/YARN-7981 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelinereader >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > > ACLs check while retrieving sub app entities. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-5357) Timeline service v2 integration with Federation
[ https://issues.apache.org/jira/browse/YARN-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Modi reassigned YARN-5357: --- Assignee: Abhishek Modi (was: Prabha Manepalli) > Timeline service v2 integration with Federation > > > Key: YARN-5357 > URL: https://issues.apache.org/jira/browse/YARN-5357 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Reporter: Vrushali C >Assignee: Abhishek Modi >Priority: Major > > Jira to note the discussion points from an initial chat about integrating > Timeline Service v2 with Federation (YARN-2915). > cc [~subru] [~curino] > For Federation: > - all entities that belong to the same flow run should have the same cluster > name > - app id in the same flow run strongly ordered in time > - need a logical cluster name and physical cluster name > - a possibility to implement the Application TimelineCollector as an > interceptor in the AMRMProxyService. > For Timeline Service: > - need to store physical cluster id and logical cluster id so that we don't > lose information at any level (flow/app/entity etc) > - add a new table app id to cluster mapping table > - need a different entity table/some table to store node level metrics for > physical cluster stats. Once we get to node-level rollup, we probably have to > store something in a dc, cluster, rack, node hierarchy. In that case a > physical cluster makes sense, but we'd still need some way to tie physical > and logical together in order to make automatic error detection etc that > we're envisioning feasible within a federated setup. > For the Cluster Naming convention: > - three situations for cluster name: > > app submitted to router should take federated (aka logical) cluster name > > app submitted directly to RM should take physical cluster name > > Info about the physical cluster in entities? > - suggestion to set the cluster name as yarn tag at the router level (in the > app submission context) > Other points to note: > - for federation to work smoothly in environments that use HDFS some > additional considerations are needed, and possibly some solution like what is > being used at Twitter with the nFly approach. > Email thread context: > {code} > -- Forwarded message -- > From: Joep Rottinghuis > Date: Fri, Jul 8, 2016 at 1:22 PM > Subject: Re: Federation -Timeline Service meeting notes > To: Subramaniam Venkatraman Krishnan > Cc: Sangjin Lee, Vrushali Channapattan , Carlo Curino > Thanks for the notes. > I think that for federation to work smoothly in environments that use HDFS > some additional considerations are needed, and possibly some solution like > what we're using at Twitter with our nFly approach. > bq. - need a different entity table/some table to store node level metrics > for physical cluster stats > Once we get to node-level rollup, we probably have to store something in a > dc, cluster, rack, node hierarchy. In that case a physical cluster makes > sense, but we'd still need some way to tie physical and logical together in > order to make automatic error detection etc that we're envisioning feasible > within a federated setup. > Cheers, > Joep > On Fri, Jul 8, 2016 at 1:00 PM, Subramaniam Venkatraman Krishnan wrote: > Thanks Vrushali for crisply capturing the essential from our rambling > discussion J. > > Sangjin, I just want to add one comment to yours – we want to retain the > physical cluster name (possibly as a new entity type) so that we don’t lose > information & we can cluster level rollups even if they are not efficient. > > Additionally, based on the walkthrough of Federation design: > · There was general agreement with the proposed approach. > · There is a possibility to implement the Application > TimelineCollector as an interceptor in the AMRMProxyService. > · Joep raised the concern that it would be better if the RMs > obtain the epoch from FederationStateStore. This is not currently in the > roadmap of our MVP but we definitely plan to address this in future. > > Regards, > Subru > > From: Sangjin Lee > Sent: Thursday, July 07, 2016 6:22 PM > To: Vrushali Channapattan > Cc: Joep Rottinghuis; Carlo Curino; Subramaniam Venkatraman Krishnan > Subject: Re: Federation -Timeline Service meeting notes > > Thanks for the summary Vrushali! > > Just so that we're on the same page regarding the terminology, I > understand we're using the terms "logical cluster" and "federated cluster" > interchangeably. > > Also, between using the federated cluster name and the home cluster name > as a solution, I think we were leaning towards the federated cluster name > (al
[jira] [Assigned] (YARN-7978) Do ACLs check for flowrun entities
[ https://issues.apache.org/jira/browse/YARN-7978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Modi reassigned YARN-7978: --- Assignee: Abhishek Modi > Do ACLs check for flowrun entities > -- > > Key: YARN-7978 > URL: https://issues.apache.org/jira/browse/YARN-7978 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelinereader >Reporter: Rohith Sharma K S >Assignee: Abhishek Modi >Priority: Major > > Verify ACLs while retrieving flowrun entities -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-2499) Respect labels in preemption policy of fair scheduler
[ https://issues.apache.org/jira/browse/YARN-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhaohui Xin reassigned YARN-2499: - Assignee: Zhaohui Xin > Respect labels in preemption policy of fair scheduler > - > > Key: YARN-2499 > URL: https://issues.apache.org/jira/browse/YARN-2499 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Reporter: Wangda Tan >Assignee: Zhaohui Xin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-9294) Potential race condition in setting GPU cgroups & execute command in the selected cgroup
[ https://issues.apache.org/jira/browse/YARN-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766901#comment-16766901 ] Zhankun Tang commented on YARN-9294: [~oliverhuh...@gmail.com] , good job! Looking forward to your patch. > Potential race condition in setting GPU cgroups & execute command in the > selected cgroup > > > Key: YARN-9294 > URL: https://issues.apache.org/jira/browse/YARN-9294 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 2.10.0 >Reporter: Keqiu Hu >Assignee: Keqiu Hu >Priority: Critical > > Environment is latest branch-2 head > OS: RHEL 7.4 > *Observation* > Out of ~10 container allocations with GPU requirement, at least 1 of the > allocated containers would lose GPU isolation. Even if I asked for 1 GPU, I > could still have visibility to all GPUs on the same machine when running > nvidia-smi. > The funny thing is even though I have visibility to all GPUs at the moment of > executing container-executor (say ordinal 0,1,2,3), but cgroups jailed the > process's access to only that single GPU after sometime. > The underlying process trying to access GPU would take the initial > information as source of truth and try to access physical 0 GPU which is not > really available to the process. This results in a > [CUDA_ERROR_INVALID_DEVICE: invalid device ordinal] error. > Validated the container-executor commands are correct: > {code:java} > PrivilegedOperationExecutor command: > [/export/apps/hadoop/nodemanager/latest/bin/container-executor, --module-gpu, > --container_id, container_e22_1549663278916_0249_01_01, --excluded_gpus, > 0,1,2,3] > PrivilegedOperationExecutor command: > [/export/apps/hadoop/nodemanager/latest/bin/container-executor, khu, khu, 0, > application_1549663278916_0249, > /grid/a/tmp/yarn/nmPrivate/container_e22_1549663278916_0249_01_01.tokens, > /grid/a/tmp/yarn, /grid/a/tmp/userlogs, > /export/apps/jdk/JDK-1_8_0_172/jre/bin/java, -classpath, ..., -Xmx256m, > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer, > khu, application_1549663278916_0249, > container_e22_1549663278916_0249_01_01, ltx1-hcl7552.grid.linkedin.com, > 8040, /grid/a/tmp/yarn] > {code} > So most likely a race condition between these two operations? > cc [~jhung] > Another potential theory is the cgroups creation for the container actually > failed but the error was swallowed silently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org