[
https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472679#comment-16472679
]
Eric Badger commented on YARN-8274:
-----------------------------------
bq. With 3.1.1 code freeze on Saturday, it is easy to make mistakes, and I like
to get YARN-7654 committed before end of today. YARN-7654 and YARN-8207 are
probably left uncommitted for too long
I understand that you want to get these patches into 3.1.1, but I don't believe
we should rush to get features into releases and in the process compromise on
quality. Rushed patches/reviews lead to bugs like this happening at an elevated
rate. I'm also not particularly compelled by the argument that YARN-7654 and
YARN-8207 have been uncommitted for too long. YARN-8207 ended up being a 127 kB
patch of entirely C code, which is incredibly time-consuming to review, while
YARN-7654 is now on patch number 23. It's not like these aren't getting
reviewed, they are just going through a normal process of comprehensive review.
I think that YARN-8027 getting committed in 2 weeks is a semi-miracle given the
size, complexity, and possible ramifications of the changes. Reviewing that
much C code (especially in a setuid binary) throughout 10 different patches is
basically a full-time job. [~jlowe] has spent countless more hours/days than I
think should be reasonably expected and is still working in an attempt to get
these patches into 3.1.1. If anything, he should be commended and thanked for
his yeoman’s effort here regardless of whether YARN-7654 makes it into 3.1.1.
So, while I understand that deadlines exist and that we should strive to meet
them, I don't believe that we should rush patches in solely because of a
deadline. That destabilizes the project and causes more work for everyone. If a
patch/feature isn't fully ready, we should step back and get it into the next
release rather than cut time on reviews and possibly miss something. At the end
of the day, if we are introducing bugs like this consistently, which recently
we have been, then we are clearly iterating too quickly and need to spend more
time on reviewing each patch instead of rushing them to being committed.
> Docker command error during container relaunch
> ----------------------------------------------
>
> Key: YARN-8274
> URL: https://issues.apache.org/jira/browse/YARN-8274
> Project: Hadoop YARN
> Issue Type: Task
> Reporter: Billie Rinaldi
> Assignee: Jason Lowe
> Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8274.001.patch, YARN-8274.002.patch
>
>
> I initiated container relaunch with a "sleep 60; exit 1" launch command and
> saw a "not a docker command" error on relaunch. Haven't figured out why this
> is happening, but it seems like it has been introduced recently to
> trunk/branch-3.1. cc [[email protected]] [~ebadger]
> {noformat}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
> Relaunch container failed
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
> at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-05-09 21:41:46,631 INFO
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from
> container-launch.
> 2018-05-09 21:41:46,631 INFO
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id:
> container_1525897486447_0003_01_000002
> 2018-05-09 21:41:46,631 INFO
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7
> 2018-05-09 21:41:46,631 INFO
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception
> message: Relaunch container failed
> 2018-05-09 21:41:46,631 INFO
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error
> output: docker: 'container_1525897486447_0003_01_000002' is not a docker
> command.
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]