Hi Jonathan,

Have you opened up a YARN JIRA with your findings? If not, that would be
the next step in debugging the issue and coding up a fix. This certainly
sounds like a bug and something that we should get to the bottom of.

As far as Nodemanagers becoming unhealthy, a config could be added to
prevent this. But, if you're only seeing 1 failure out of millions of
tasks, this seems like it would unmask more problems than it fixes. 1
container failing is bad, but a node going bad and failing every container
that runs on it forever until it is shutdown is much, much worse. However,
if you think that you have a use case that could benefit from the config
being optional, that is something we could also look into. That would be a
separate YARN JIRA as well.

Thanks,

Eric

On Mon, Sep 17, 2018 at 12:37 PM, Jonathan Bender <
jonben...@stripe.com.invalid> wrote:

> Hello,
>
> We started are using CGroups with LinuxContainerExecutor recently, running
> Apache Hadoop 3.0.0. Occasionally (once out of many millions of tasks) a
> yarn container will fail with a message like the following:
> WARN privileged.PrivilegedOperationExecutor: Shell execution returned
> exit code: 35. Privileged Execution Operation Stderr:
> Could not create container dirsCould not create local files and directories
>
> Looking at the container executor source it's traceable to errors here:
> https://github.com/apache/hadoop/blob/release-3.0.0-RC1/
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-
> nodemanager/src/main/native/container-executor/impl/
> container-executor.c#L1604
>
> And ultimately to https://github.com/apache/hadoop/blob/release-3.0.0-RC1/
> hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-
> nodemanager/src/main/native/container-executor/impl/
> container-executor.c#L672
>
> The root failure seems to be in the underlying mkdir call, but that exit
> code / errno is swallowed so we don't have more details. We tend to see
> this when many containers start at the same time for the same application
> on a host, and suspect it may be related to some race conditions around
> those shared directories between containers for the same application.
>
> Has anyone seen similar failures in using the LinuxContainerExecutor?
>
> This issue compounded because LinuxContainerExecutor renders the node
> unhealthy in these scenarios: https://github.com/apache/
> hadoop/blob/release-3.0.0-RC1/hadoop-yarn-project/hadoop-
> yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/
> apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java#L566
>
> Under some circumstances this seems appropriate, but since this is a
> transient failure (none of these machines were at capacity for disks,
> inodes, etc) we shouldn't down the NodeManager. The behavior to add this
> blacklisting came as part of https://issues.apache.org/
> jira/browse/YARN-6302 which seems perfectly valid, but perhaps we should
> make this configurable so certain users can opt out?
>
> Cheers,
> Jon
>

Reply via email to