Joseph Wu created MESOS-7858:
--------------------------------

             Summary: Launching a nested container with namespace/pid 
isolation, with glibc < 2.25, may deadlock the LinuxLauncher and 
MesosContainerizer
                 Key: MESOS-7858
                 URL: https://issues.apache.org/jira/browse/MESOS-7858
             Project: Mesos
          Issue Type: Bug
          Components: containerization
    Affects Versions: 1.3.0
            Reporter: Joseph Wu


This bug in glibc (fixed in glibc 2.25) will sometimes cause a child process of 
a {{fork}} to {{assert}} incorrectly, if the parent enters a new pid namespace 
before forking: 
https://sourceware.org/bugzilla/show_bug.cgi?id=15392
https://sourceware.org/bugzilla/show_bug.cgi?id=21386

The LinuxLauncher code happens to do this when launching nested containers:
* The MesosContainerizer process launches a subprocess, with a customized 
{{ns::clone}} function as an argument.  The thread then basically waits for the 
launch to succeed and return a child PID: 
https://github.com/apache/mesos/blob/1.3.x/src/slave/containerizer/mesos/linux_launcher.cpp#L495
* A separate thread in the Mesos agent forks and then waits for the grandchild 
to report a PID: 
https://github.com/apache/mesos/blob/1.3.x/src/linux/ns.hpp#L453
* The child of the fork first enters the namespaces (including a pid namespace) 
and then forks a grandchild.  The child then calls {{waitpid}} on the 
grandchild: https://github.com/apache/mesos/blob/1.3.x/src/linux/ns.hpp#L555
* Due to the glibc bug, the grandchild sometimes never returns from the 
{{fork}} here: https://github.com/apache/mesos/blob/1.3.x/src/linux/ns.hpp#L540

According to the glibc bug, we can work around this by:
{quote}
The obvious solution is just to use clone() after setns() and never use fork() 
- and one can certainly patch both programs to do so. Nevertheless it would be 
nice to see if fork() also worked after setns(), especially since there is no 
inherent reason for it not to.
{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to