[
https://issues.apache.org/jira/browse/YARN-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16769049#comment-16769049
]
Keqiu Hu commented on YARN-9294:
--------------------------------
After more debugging, we found the race condition is not because of flakiness
with cgroup creation & launching job in the cgroup slice, but caused by an
incompatibility with RHEL 7. Would love to hear if anyone is the community
experienced the same issue with RHEL 7. Basically the existing logic to `mkdir
container_123` && `echo taskId > container_123/tasks` doesn't work anymore.
There is some sanity work in the OS that if the process is not registered in
`/sys/fs/cgroup/systemd/`, the taskId will be removed from
`container_123/tasks`.
There are a couple ways to fix the issue, one is to use RHEL 7 specific cgroups
CLI like * systemd-run --unit=hu --slice=hadoop nohup /root/echo.sh* to start
the container executor, but this won't be compatible with other operating
systems. Still trying to figure out if there is a way to make it work for most
OS.
> Potential race condition in setting GPU cgroups & execute command in the
> selected cgroup
> ----------------------------------------------------------------------------------------
>
> Key: YARN-9294
> URL: https://issues.apache.org/jira/browse/YARN-9294
> Project: Hadoop YARN
> Issue Type: Bug
> Components: yarn
> Affects Versions: 2.10.0
> Reporter: Keqiu Hu
> Assignee: Keqiu Hu
> Priority: Critical
>
> Environment is latest branch-2 head
> OS: RHEL 7.4
> *Observation*
> Out of ~10 container allocations with GPU requirement, at least 1 of the
> allocated containers would lose GPU isolation. Even if I asked for 1 GPU, I
> could still have visibility to all GPUs on the same machine when running
> nvidia-smi.
> The funny thing is even though I have visibility to all GPUs at the moment of
> executing container-executor (say ordinal 0,1,2,3), but cgroups jailed the
> process's access to only that single GPU after sometime.
> The underlying process trying to access GPU would take the initial
> information as source of truth and try to access physical 0 GPU which is not
> really available to the process. This results in a
> [CUDA_ERROR_INVALID_DEVICE: invalid device ordinal] error.
> Validated the container-executor commands are correct:
> {code:java}
> PrivilegedOperationExecutor command:
> [/export/apps/hadoop/nodemanager/latest/bin/container-executor, --module-gpu,
> --container_id, container_e22_1549663278916_0249_01_000001, --excluded_gpus,
> 0,1,2,3]
> PrivilegedOperationExecutor command:
> [/export/apps/hadoop/nodemanager/latest/bin/container-executor, khu, khu, 0,
> application_1549663278916_0249,
> /grid/a/tmp/yarn/nmPrivate/container_e22_1549663278916_0249_01_000001.tokens,
> /grid/a/tmp/yarn, /grid/a/tmp/userlogs,
> /export/apps/jdk/JDK-1_8_0_172/jre/bin/java, -classpath, ..., -Xmx256m,
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer,
> khu, application_1549663278916_0249,
> container_e22_1549663278916_0249_01_000001, ltx1-hcl7552.grid.linkedin.com,
> 8040, /grid/a/tmp/yarn]
> {code}
> So most likely a race condition between these two operations?
> cc [~jhung]
> Another potential theory is the cgroups creation for the container actually
> failed but the error was swallowed silently.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]