[
https://issues.apache.org/jira/browse/YARN-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16401818#comment-16401818
]
Shane Kumpf commented on YARN-7973:
-----------------------------------
[~billie.rinaldi] - I looked into the issue you reported. The behavior you see
occurs with or without this patch.
What you see above repeated over and over is the Diagnostics field being
returned during the ContainerStatus calls. Pulling out only the Diagnostics
field from above you get:
{code:java}
Diagnostics: [2018-03-08 22:02:53.397]Exception from container-launch.
Container id: container_1520546307703_0001_01_000002
Exit code: -1
Exception message: <unknown>
Shell output: <unknown>
[2018-03-08 22:02:53.500]Diagnostic message from attempt 0 : [2018-03-08
22:02:53.500]
[2018-03-08 22:02:53.501]Container exited with a non-zero exit code -1.
,{code}
You will see this repeated once per second until the relaunch occurs again (30
seconds by default with native services). Once the relaunch occurs, you will
see the exception that the relaunch failed, as the container isn't in a
startable state. I could be convinced to call launchContainer in this case to
produce the original error if you feel that is most appropriate, but I think
there are alternative improvements to make here:
* The logs are hard to follow with the diagnostics embedded in the log entry
when returning the ContainerStatus. It looks like exceptions are repeated over
and over, as you saw. We should consider moving this to debug logging.
* Populate diagnostics with a better error in this case. The
{{ContainerExecutionExecption}} thrown as part of this ACL check does not
become part of the Diagnostics field.
* Native Services currently uses {{ContainerRetryPolicy.RETRY_ON_ALL_ERRORS}}
which may be too broad. -1 exit codes should likely be hard fails.
I'll open issues on these if that sounds good?
> Support ContainerRelaunch for Docker containers
> -----------------------------------------------
>
> Key: YARN-7973
> URL: https://issues.apache.org/jira/browse/YARN-7973
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Shane Kumpf
> Assignee: Shane Kumpf
> Priority: Major
> Attachments: YARN-7973.001.patch, YARN-7973.002.patch
>
>
> Prior to YARN-5366, {{container-executor}} would remove the Docker container
> when it exited. The removal is now handled by the
> {{DockerLinuxContainerRuntime}}. {{ContainerRelaunch}} is intended to reuse
> the workdir from the previous attempt, and does not call {{cleanupContainer}}
> prior to {{launchContainer}}. The container ID is reused as well. As a
> result, the previous Docker container still exists, resulting in an error
> from Docker indicating the a container by that name already exists.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]