[
https://issues.apache.org/jira/browse/YARN-9099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757354#comment-16757354
]
Szilard Nemeth commented on YARN-9099:
--------------------------------------
Thanks [~sunilg]
> GpuResourceAllocator#getReleasingGpus calculates number of GPUs in a wrong way
> ------------------------------------------------------------------------------
>
> Key: YARN-9099
> URL: https://issues.apache.org/jira/browse/YARN-9099
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Szilard Nemeth
> Assignee: Szilard Nemeth
> Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.3
>
> Attachments: YARN-9099.001.patch, YARN-9099.002.patch
>
>
> getReleasingGpus plays an important role in the calculation which happens
> when GpuAllocator assign GPUs to a container, see:
> GpuResourceAllocator#internalAssignGpus.
> If multiple GPUs are assigned to the same container, getReleasingGpus will
> return an invalid number.
> The iterator goes over on mappings of (GPU device, container ID) and it
> retrieves the container by its ID the number of times the container ID is
> mapped to any device.
> Then for every container, the resource value for the GPU resource is added to
> a running sum.
> Obviously, if a container is mapped to 2 or more devices, then the
> container's GPU resource counter is added to the running sum as many times as
> the number of GPU devices the container has.
> Example:
> Let's suppose {{usedDevices}} contains these mappings:
> - (GPU1, container1)
> - (GPU2, container1)
> - (GPU3, container2)
> GPU resource value is 2 for container1 and
> GPU resource value is 1 for container2.
> Then, if container1 is in a running state, getReleasingGpus will return 4
> instead of 2.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]