Re: Debugging framework registration from inside docker
Hi Vinod - this is good news! Just the fact that I'm not barking up the wrong tree and that indeed it is a known issue. Cheers Jim On 11 June 2015 at 18:16, Vinod Kone vinodk...@gmail.com wrote: On Thu, Jun 11, 2015 at 4:00 AM, James Vanns jvanns@gmail.com wrote: I think I can conclude then that this just won't work; one cannot run a framework as a docker container using bridged networking. This is because a POST to the MM that libprocess does on your framework's behalf, includes the non-route-able private docker IP and that is what the MM well then try to communicate with? Setting LIBPROCESS_IP to the host IP will of course not work because then libprocess or somewhere in the mesos framework code an attempt at bind()ing to that interface is made and fails because it does not exist in bridge mode. You are right on track. This is a known issue: https://issues.apache.org/jira/browse/MESOS-809. Anindya has submitted a short term fix, which unfortunately never landed. I'll shepherd and commit this. *If* the above is correct then the question I suppose is why does the communication channel get established in that way? Why off the back of some data in a POST rather than the connected endpoint (that presumably docker would manage/forward as it would with a regular web service, for example)? Is this some caveat of using zookeeper? Longer term, the plan is for the master is to reuse the connection opened by the scheduler and not open a new one, as you mentioned. See https://issues.apache.org/jira/browse/MESOS-2289 I'm sure someone will correct me where I'm wrong ;) You are not! -- -- Senior Code Pig Industrial Light Magic
Re: Debugging framework registration from inside docker
For what exactly? I thought that was for slave-master communication? There is no problem there. Or are you suggesting that from inside the running container I set at least LIBPROCESS_IP to the host IP rather than the IP of eth0 the container sees? Won't that screw with the docker bridge routing? This doesn't quite make sense. I have other network connections inside this container and those channels are established and communicating fine. It's just with the mesos master for some reason. Just to be clear; * The running process is a scheduling framework * It does not listen for any inbound connection requests * It, of course, does attempt an outbound connection to the zookeeper to get the MM (this works) * It then attempts to establish a connection with the MM (this also works) * When the MM sends a response, it fails - it effectively tries to send the response back to the private/internal docker IP where my scheduler is running. * This problem disappears when run with --net=host TCPDump never shows any inbound traffic; IP 172.17.1.197.55182 172.20.121.193.5050 ... Therefore there is never any ACK# that corresponds with the SEQ# and these are just re-transmissions. I think! Jim On 10 June 2015 at 18:16, Steven Schlansker sschlans...@opentable.com wrote: On Jun 10, 2015, at 10:10 AM, James Vanns jvanns@gmail.com wrote: Hi. When attempting to run my scheduler inside a docker container in --net=bridge mode it never receives acknowledgement or a reply to that request. However, it works fine in --net=host mode. It does not listen on any port as a service so does not expose any. The scheduler receives the mesos master (leader) from zookeeper fine but fails to register the framework with that master. It just loops trying to do so - the master sees the registration but deactivates it immediately as apparently it disconnects. It doesn't disconnect but is obviously unreachable. I see the reason for this in the sendto() and the master log file -- because the internal docker bridge IP is included in the POST and perhaps that is how the master is trying to talk back to the requesting framework?? Inside the container is this; tcp0 0 0.0.0.0:44431 0.0.0.0:* LISTEN 1/scheduler This is not my code! I'm at a loss where to go from here. Anyone got any further suggestions to fix this? You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact that you are on a virtual Docker interface. -- -- Senior Code Pig Industrial Light Magic
Re: Debugging framework registration from inside docker
Looks like I share the same symptoms as this 'marathon inside container' problem; https://groups.google.com/d/topic/marathon-framework/aFIlv-VnF58/discussion I guess that sheds some light on the subject ;) On 11 June 2015 at 09:43, James Vanns jvanns@gmail.com wrote: For what exactly? I thought that was for slave-master communication? There is no problem there. Or are you suggesting that from inside the running container I set at least LIBPROCESS_IP to the host IP rather than the IP of eth0 the container sees? Won't that screw with the docker bridge routing? This doesn't quite make sense. I have other network connections inside this container and those channels are established and communicating fine. It's just with the mesos master for some reason. Just to be clear; * The running process is a scheduling framework * It does not listen for any inbound connection requests * It, of course, does attempt an outbound connection to the zookeeper to get the MM (this works) * It then attempts to establish a connection with the MM (this also works) * When the MM sends a response, it fails - it effectively tries to send the response back to the private/internal docker IP where my scheduler is running. * This problem disappears when run with --net=host TCPDump never shows any inbound traffic; IP 172.17.1.197.55182 172.20.121.193.5050 ... Therefore there is never any ACK# that corresponds with the SEQ# and these are just re-transmissions. I think! Jim On 10 June 2015 at 18:16, Steven Schlansker sschlans...@opentable.com wrote: On Jun 10, 2015, at 10:10 AM, James Vanns jvanns@gmail.com wrote: Hi. When attempting to run my scheduler inside a docker container in --net=bridge mode it never receives acknowledgement or a reply to that request. However, it works fine in --net=host mode. It does not listen on any port as a service so does not expose any. The scheduler receives the mesos master (leader) from zookeeper fine but fails to register the framework with that master. It just loops trying to do so - the master sees the registration but deactivates it immediately as apparently it disconnects. It doesn't disconnect but is obviously unreachable. I see the reason for this in the sendto() and the master log file -- because the internal docker bridge IP is included in the POST and perhaps that is how the master is trying to talk back to the requesting framework?? Inside the container is this; tcp0 0 0.0.0.0:44431 0.0.0.0:* LISTEN 1/scheduler This is not my code! I'm at a loss where to go from here. Anyone got any further suggestions to fix this? You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact that you are on a virtual Docker interface. -- -- Senior Code Pig Industrial Light Magic -- -- Senior Code Pig Industrial Light Magic
Re: Debugging framework registration from inside docker
I think I can conclude then that this just won't work; one cannot run a framework as a docker container using bridged networking. This is because a POST to the MM that libprocess does on your framework's behalf, includes the non-route-able private docker IP and that is what the MM well then try to communicate with? Setting LIBPROCESS_IP to the host IP will of course not work because then libprocess or somewhere in the mesos framework code an attempt at bind()ing to that interface is made and fails because it does not exist in bridge mode. *If* the above is correct then the question I suppose is why does the communication channel get established in that way? Why off the back of some data in a POST rather than the connected endpoint (that presumably docker would manage/forward as it would with a regular web service, for example)? Is this some caveat of using zookeeper? I'm sure someone will correct me where I'm wrong ;) Cheers, Jim On 11 June 2015 at 10:00, James Vanns jvanns@gmail.com wrote: Looks like I share the same symptoms as this 'marathon inside container' problem; https://groups.google.com/d/topic/marathon-framework/aFIlv-VnF58/discussion I guess that sheds some light on the subject ;) On 11 June 2015 at 09:43, James Vanns jvanns@gmail.com wrote: For what exactly? I thought that was for slave-master communication? There is no problem there. Or are you suggesting that from inside the running container I set at least LIBPROCESS_IP to the host IP rather than the IP of eth0 the container sees? Won't that screw with the docker bridge routing? This doesn't quite make sense. I have other network connections inside this container and those channels are established and communicating fine. It's just with the mesos master for some reason. Just to be clear; * The running process is a scheduling framework * It does not listen for any inbound connection requests * It, of course, does attempt an outbound connection to the zookeeper to get the MM (this works) * It then attempts to establish a connection with the MM (this also works) * When the MM sends a response, it fails - it effectively tries to send the response back to the private/internal docker IP where my scheduler is running. * This problem disappears when run with --net=host TCPDump never shows any inbound traffic; IP 172.17.1.197.55182 172.20.121.193.5050 ... Therefore there is never any ACK# that corresponds with the SEQ# and these are just re-transmissions. I think! Jim On 10 June 2015 at 18:16, Steven Schlansker sschlans...@opentable.com wrote: On Jun 10, 2015, at 10:10 AM, James Vanns jvanns@gmail.com wrote: Hi. When attempting to run my scheduler inside a docker container in --net=bridge mode it never receives acknowledgement or a reply to that request. However, it works fine in --net=host mode. It does not listen on any port as a service so does not expose any. The scheduler receives the mesos master (leader) from zookeeper fine but fails to register the framework with that master. It just loops trying to do so - the master sees the registration but deactivates it immediately as apparently it disconnects. It doesn't disconnect but is obviously unreachable. I see the reason for this in the sendto() and the master log file -- because the internal docker bridge IP is included in the POST and perhaps that is how the master is trying to talk back to the requesting framework?? Inside the container is this; tcp0 0 0.0.0.0:44431 0.0.0.0:* LISTEN 1/scheduler This is not my code! I'm at a loss where to go from here. Anyone got any further suggestions to fix this? You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact that you are on a virtual Docker interface. -- -- Senior Code Pig Industrial Light Magic -- -- Senior Code Pig Industrial Light Magic -- -- Senior Code Pig Industrial Light Magic
Re: Debugging framework registration from inside docker
On Thu, Jun 11, 2015 at 4:00 AM, James Vanns jvanns@gmail.com wrote: I think I can conclude then that this just won't work; one cannot run a framework as a docker container using bridged networking. This is because a POST to the MM that libprocess does on your framework's behalf, includes the non-route-able private docker IP and that is what the MM well then try to communicate with? Setting LIBPROCESS_IP to the host IP will of course not work because then libprocess or somewhere in the mesos framework code an attempt at bind()ing to that interface is made and fails because it does not exist in bridge mode. You are right on track. This is a known issue: https://issues.apache.org/jira/browse/MESOS-809. Anindya has submitted a short term fix, which unfortunately never landed. I'll shepherd and commit this. *If* the above is correct then the question I suppose is why does the communication channel get established in that way? Why off the back of some data in a POST rather than the connected endpoint (that presumably docker would manage/forward as it would with a regular web service, for example)? Is this some caveat of using zookeeper? Longer term, the plan is for the master is to reuse the connection opened by the scheduler and not open a new one, as you mentioned. See https://issues.apache.org/jira/browse/MESOS-2289 I'm sure someone will correct me where I'm wrong ;) You are not!
Re: Debugging framework registration from inside docker
I believe you're correct Jim, if you set LIBPROCESS_IP=$HOST_IP libprocess will try to bind to that address as well as announce it, which won't work inside a bridged container. We've been having a similar discussion on https://github.com/wickman/pesos/issues/25. -- Tom Arnfeld Developer // DueDil On Thursday, Jun 11, 2015 at 10:00 am, James Vanns jvanns@gmail.com, wrote: Looks like I share the same symptoms as this 'marathon inside container' problem; https://groups.google.com/d/topic/marathon-framework/aFIlv-VnF58/discussion I guess that sheds some light on the subject ;) On 11 June 2015 at 09:43, James Vanns jvanns@gmail.com wrote: For what exactly? I thought that was for slave-master communication? There is no problem there. Or are you suggesting that from inside the running container I set at least LIBPROCESS_IP to the host IP rather than the IP of eth0 the container sees? Won't that screw with the docker bridge routing? This doesn't quite make sense. I have other network connections inside this container and those channels are established and communicating fine. It's just with the mesos master for some reason. Just to be clear; * The running process is a scheduling framework * It does not listen for any inbound connection requests * It, of course, does attempt an outbound connection to the zookeeper to get the MM (this works) * It then attempts to establish a connection with the MM (this also works) * When the MM sends a response, it fails - it effectively tries to send the response back to the private/internal docker IP where my scheduler is running. * This problem disappears when run with --net=host TCPDump never shows any inbound traffic; IP 172.17.1.197.55182 172.20.121.193.5050 ... Therefore there is never any ACK# that corresponds with the SEQ# and these are just re-transmissions. I think! Jim On 10 June 2015 at 18:16, Steven Schlansker sschlans...@opentable.com wrote: On Jun 10, 2015, at 10:10 AM, James Vanns jvanns@gmail.com wrote: Hi. When attempting to run my scheduler inside a docker container in --net=bridge mode it never receives acknowledgement or a reply to that request. However, it works fine in --net=host mode. It does not listen on any port as a service so does not expose any. The scheduler receives the mesos master (leader) from zookeeper fine but fails to register the framework with that master. It just loops trying to do so - the master sees the registration but deactivates it immediately as apparently it disconnects. It doesn't disconnect but is obviously unreachable. I see the reason for this in the sendto() and the master log file -- because the internal docker bridge IP is included in the POST and perhaps that is how the master is trying to talk back to the requesting framework?? Inside the container is this; tcp 0 0 0.0.0.0:44431 0.0.0.0:* LISTEN 1/scheduler This is not my code! I'm at a loss where to go from here. Anyone got any further suggestions to fix this? You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact that you are on a virtual Docker interface. -- -- Senior Code Pig Industrial Light Magic -- -- Senior Code Pig Industrial Light Magic
Re: Debugging framework registration from inside docker
On Jun 10, 2015, at 10:10 AM, James Vanns jvanns@gmail.com wrote: Hi. When attempting to run my scheduler inside a docker container in --net=bridge mode it never receives acknowledgement or a reply to that request. However, it works fine in --net=host mode. It does not listen on any port as a service so does not expose any. The scheduler receives the mesos master (leader) from zookeeper fine but fails to register the framework with that master. It just loops trying to do so - the master sees the registration but deactivates it immediately as apparently it disconnects. It doesn't disconnect but is obviously unreachable. I see the reason for this in the sendto() and the master log file -- because the internal docker bridge IP is included in the POST and perhaps that is how the master is trying to talk back to the requesting framework?? Inside the container is this; tcp0 0 0.0.0.0:44431 0.0.0.0:* LISTEN 1/scheduler This is not my code! I'm at a loss where to go from here. Anyone got any further suggestions to fix this? You may need to try setting LIBPROCESS_IP and LIBPROCESS_PORT to hide the fact that you are on a virtual Docker interface.