[ https://issues.apache.org/jira/browse/YARN-8423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520682#comment-16520682 ]
Sunil Govindan commented on YARN-8423: -------------------------------------- Thanks [~vinodkv] Attaching new patch after addressing all comments. For generic way of handling, I have opened YARN-8450 as it need to analyzed and refactored at NM level for all similar resources which may be tend to block at the time when container is released. we will continue discussing same in that Jira for a global approach mean while this issue can immediately tackle the GPU issue. Thank you. cc [~leftnoteasy] > GPU does not get released even though the application gets killed. > ------------------------------------------------------------------ > > Key: YARN-8423 > URL: https://issues.apache.org/jira/browse/YARN-8423 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn > Reporter: Sumana Sathish > Assignee: Sunil Govindan > Priority: Critical > Attachments: YARN-8423.001.patch, YARN-8423.002.patch, > kill-container-nm.log > > > Run an Tensor flow app requesting one GPU. > Kill the application once the GPU is allocated > Query the nodemanger once the application is killed.We see that GPU is not > being released. > {code} > curl -i <NM>/ws/v1/node/resources/yarn.io%2Fgpu > {"gpuDeviceInformation":{"gpus":[{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":0,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}},{"productName":"<productName>","uuid":"GPU-<UID>","minorNumber":1,"gpuUtilizations":{"overallGpuUtilization":0.0},"gpuMemoryUsage":{"usedMemoryMiB":73,"availMemoryMiB":12125,"totalMemoryMiB":12198},"temperature":{"currentGpuTemp":28.0,"maxGpuTemp":85.0,"slowThresholdGpuTemp":82.0}}],"driverVersion":"<version>"},"totalGpuDevices":[{"index":0,"minorNumber":0},{"index":1,"minorNumber":1}],"assignedGpuDevices":[{"index":0,"minorNumber":0,"containerId":"container_<containerID>"}]} > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org