[
https://issues.apache.org/jira/browse/YARN-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766891#comment-16766891
]
Keqiu Hu commented on YARN-9294:
--------------------------------
Confirmed it is a race condition in cgroups creation & executing command in the
cgroups. We plan to go ahead with a safe check between these two privileged
operations. Note the same issue should apply to 3.1+ as well. cc [~wangda]
[~tangzhankun]
> Potential race condition in setting GPU cgroups & execute command in the
> selected cgroup
> ----------------------------------------------------------------------------------------
>
> Key: YARN-9294
> URL: https://issues.apache.org/jira/browse/YARN-9294
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Affects Versions: 2.10.0
> Reporter: Keqiu Hu
> Assignee: Keqiu Hu
> Priority: Critical
>
> Environment is latest branch-2 head
> OS: RHEL 7.4
> *Observation*
> Out of ~10 container allocations with GPU requirement, at least 1 of the
> allocated containers would lose GPU isolation. Even if I asked for 1 GPU, I
> could still have visibility to all GPUs on the same machine when running
> nvidia-smi.
> The funny thing is even though I have visibility to all GPUs at the moment of
> executing container-executor (say ordinal 0,1,2,3), but cgroups jailed the
> process's access to only that single GPU after sometime.
> The underlying process trying to access GPU would take the initial
> information as source of truth and try to access physical 0 GPU which is not
> really available to the process. This results in a
> [CUDA_ERROR_INVALID_DEVICE: invalid device ordinal] error.
> Validated the container-executor commands are correct:
> {code:java}
> PrivilegedOperationExecutor command:
> [/export/apps/hadoop/nodemanager/latest/bin/container-executor, --module-gpu,
> --container_id, container_e22_1549663278916_0249_01_000001, --excluded_gpus,
> 0,1,2,3]
> PrivilegedOperationExecutor command:
> [/export/apps/hadoop/nodemanager/latest/bin/container-executor, khu, khu, 0,
> application_1549663278916_0249,
> /grid/a/tmp/yarn/nmPrivate/container_e22_1549663278916_0249_01_000001.tokens,
> /grid/a/tmp/yarn, /grid/a/tmp/userlogs,
> /export/apps/jdk/JDK-1_8_0_172/jre/bin/java, -classpath, ..., -Xmx256m,
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer,
> khu, application_1549663278916_0249,
> container_e22_1549663278916_0249_01_000001, ltx1-hcl7552.grid.linkedin.com,
> 8040, /grid/a/tmp/yarn]
> {code}
> So most likely a race condition between these two operations?
> cc [~jhung]
> Another potential theory is the cgroups creation for the container actually
> failed but the error was swallowed silently.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]