Keqiu Hu created YARN-9294:
------------------------------
Summary: Potential race condition in setting GPU cgroups & execute
command in the selected cgroup
Key: YARN-9294
URL: https://issues.apache.org/jira/browse/YARN-9294
Project: Hadoop YARN
Issue Type: Bug
Components: yarn
Affects Versions: 2.10.0
Reporter: Keqiu Hu
Assignee: Keqiu Hu
Environment is latest branch-2 head
OS: RHEL 7.4
*Observation*
Out of ~10 container allocations with GPU requirement, at least 1 of the
allocated containers would lose GPU isolation. Even if I asked for 1 GPU, I
could still have visibility to all GPUs on the same machine when running
nvidia-smi.
The funny thing is even though I have visibility to all GPUs at the moment of
executing container-executor (say ordinal 0,1,2,3), but cgroups jailed the
process's access to only that single GPU after sometime.
The underlying process trying to access GPU would take the initial information
as source of truth and try to access physical 0 GPU which is not really
available to the process. This results in a [CUDA_ERROR_INVALID_DEVICE: invalid
device ordinal] error.
Validated the container-executor commands are correct:
{code:java}
PrivilegedOperationExecutor command:
[/export/apps/hadoop/nodemanager/latest/bin/container-executor, --module-gpu,
--container_id, container_e22_1549663278916_0249_01_000001, --excluded_gpus,
0,1,2,3]
PrivilegedOperationExecutor command:
[/export/apps/hadoop/nodemanager/latest/bin/container-executor, khu, khu, 0,
application_1549663278916_0249,
/grid/a/tmp/yarn/nmPrivate/container_e22_1549663278916_0249_01_000001.tokens,
/grid/a/tmp/yarn, /grid/a/tmp/userlogs,
/export/apps/jdk/JDK-1_8_0_172/jre/bin/java, -classpath, ..., -Xmx256m,
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer,
khu, application_1549663278916_0249,
container_e22_1549663278916_0249_01_000001, ltx1-hcl7552.grid.linkedin.com,
8040, /grid/a/tmp/yarn]
{code}
So most likely a race condition between these two operations?
cc [~jhung]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]