Re: [openstack-dev] [Nova] Automatic evacuate
- Original Message - On 10/21/2014 07:53 PM, David Vossel wrote: - Original Message - -Original Message- From: Russell Bryant [mailto:rbry...@redhat.com] Sent: October 21, 2014 15:07 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Nova] Automatic evacuate On 10/21/2014 06:44 AM, Balázs Gibizer wrote: Hi, Sorry for the top posting but it was hard to fit my complete view inline. I'm also thinking about a possible solution for automatic server evacuation. I see two separate sub problems of this problem: 1)compute node monitoring and fencing, 2)automatic server evacuation Compute node monitoring is currently implemented in servicegroup module of nova. As far as I understand pacemaker is the proposed solution in this thread to solve both monitoring and fencing but we tried and found out that pacemaker_remote on baremetal does not work together with fencing (yet), see [1]. So if we need fencing then either we have to go for normal pacemaker instead of pacemaker_remote but that solution doesn't scale or we configure and call stonith directly when pacemaker detect the compute node failure. I didn't get the same conclusion from the link you reference. It says: That is not to say however that fencing of a baremetal node works any differently than that of a normal cluster-node. The Pacemaker policy engine understands how to fence baremetal remote-nodes. As long as a fencing device exists, the cluster is capable of ensuring baremetal nodes are fenced in the exact same way as normal cluster-nodes are fenced. So, it sounds like the core pacemaker cluster can fence the node to me. I CC'd David Vossel, a pacemaker developer, to see if he can help clarify. It seems there is a contradiction between chapter 1.5 and 7.2 in [1] as 7.2 states: There are some complications involved with understanding a bare-metal node's state that virtual nodes don't have. Once this logic is complete, pacemaker will be able to integrate bare-metal nodes in the same way virtual remote-nodes currently are. Some special considerations for fencing will need to be addressed. Let's wait for David's statement on this. Hey, That's me! I can definitely clear all this up. First off, this document is out of sync with the current state upstream. We're already past Pacemaker v1.1.12 upstream. Section 7.2 of the document being referenced is still talking about future v1.1.11 features. I'll make it simple. If the document references anything that needs to be done in the future, it's already done. Pacemaker remote is feature complete at this point. I've accomplished everything I originally set out to do. I see one change though. In 7.1 I talk about wanting pacemaker to be able to manage resources in containers. I mention something about libvirt sandbox. I scrapped whatever I was doing there. Pacemaker now has docker support. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/docker I've known this document is out of date. It's on my giant list of things to do. Sorry for any confusion. As far as pacemaker remote and fencing goes, remote-nodes are fenced the exact same way as cluster-nodes. The only consideration that needs to be made is that the cluster-nodes (nodes running the full pacemaker+corosync stack) are the only nodes allowed to initiate fencing. All you have to do is make sure the fencing devices you want to use to fence remote-nodes are accessible to the cluster-nodes. From there you are good to go. Let me know if there's anything else I can clear up. Pacemaker remote was designed to be the solution for the exact scenario you all are discussing here. Compute nodes and pacemaker remote are made for one another :D If anyone is interested in prototyping pacemaker remote for this compute node use case, make sure to include me. I have done quite a bit research into how to maximize pacemaker's ability to scale horizontally. As part of that research I've made a few changes that are directly related to all of this that are not yet in an official pacemaker release. Come to me for the latest rpms and you'll have a less painful experience setting all this up :) -- Vossel Hi Vossel, Could you send us a link to the source RPMs please, we have tested on CentOS7. It might need a recompile. Yes, centos 7.0 isn't going to have the rpms you need to test this. There are a couple of things you can do. 1. I put the rhel7 related rpms I test with in this repo. http://davidvossel.com/repo/os/el7/ *disclaimer* I only maintain this repo for myself. I'm not committed to keeping it active or up-to-date. It just happens to be updated right now for my own use. That will give you test rpms for the pacemaker version I'm currently using plus the latest libqb. If you're going to do any
Re: [openstack-dev] [Nova] Automatic evacuate
- Original Message - -Original Message- From: Russell Bryant [mailto:rbry...@redhat.com] Sent: October 21, 2014 15:07 To: openstack-dev@lists.openstack.org Subject: Re: [openstack-dev] [Nova] Automatic evacuate On 10/21/2014 06:44 AM, Balázs Gibizer wrote: Hi, Sorry for the top posting but it was hard to fit my complete view inline. I'm also thinking about a possible solution for automatic server evacuation. I see two separate sub problems of this problem: 1)compute node monitoring and fencing, 2)automatic server evacuation Compute node monitoring is currently implemented in servicegroup module of nova. As far as I understand pacemaker is the proposed solution in this thread to solve both monitoring and fencing but we tried and found out that pacemaker_remote on baremetal does not work together with fencing (yet), see [1]. So if we need fencing then either we have to go for normal pacemaker instead of pacemaker_remote but that solution doesn't scale or we configure and call stonith directly when pacemaker detect the compute node failure. I didn't get the same conclusion from the link you reference. It says: That is not to say however that fencing of a baremetal node works any differently than that of a normal cluster-node. The Pacemaker policy engine understands how to fence baremetal remote-nodes. As long as a fencing device exists, the cluster is capable of ensuring baremetal nodes are fenced in the exact same way as normal cluster-nodes are fenced. So, it sounds like the core pacemaker cluster can fence the node to me. I CC'd David Vossel, a pacemaker developer, to see if he can help clarify. It seems there is a contradiction between chapter 1.5 and 7.2 in [1] as 7.2 states: There are some complications involved with understanding a bare-metal node's state that virtual nodes don't have. Once this logic is complete, pacemaker will be able to integrate bare-metal nodes in the same way virtual remote-nodes currently are. Some special considerations for fencing will need to be addressed. Let's wait for David's statement on this. Hey, That's me! I can definitely clear all this up. First off, this document is out of sync with the current state upstream. We're already past Pacemaker v1.1.12 upstream. Section 7.2 of the document being referenced is still talking about future v1.1.11 features. I'll make it simple. If the document references anything that needs to be done in the future, it's already done. Pacemaker remote is feature complete at this point. I've accomplished everything I originally set out to do. I see one change though. In 7.1 I talk about wanting pacemaker to be able to manage resources in containers. I mention something about libvirt sandbox. I scrapped whatever I was doing there. Pacemaker now has docker support. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/docker I've known this document is out of date. It's on my giant list of things to do. Sorry for any confusion. As far as pacemaker remote and fencing goes, remote-nodes are fenced the exact same way as cluster-nodes. The only consideration that needs to be made is that the cluster-nodes (nodes running the full pacemaker+corosync stack) are the only nodes allowed to initiate fencing. All you have to do is make sure the fencing devices you want to use to fence remote-nodes are accessible to the cluster-nodes. From there you are good to go. Let me know if there's anything else I can clear up. Pacemaker remote was designed to be the solution for the exact scenario you all are discussing here. Compute nodes and pacemaker remote are made for one another :D If anyone is interested in prototyping pacemaker remote for this compute node use case, make sure to include me. I have done quite a bit research into how to maximize pacemaker's ability to scale horizontally. As part of that research I've made a few changes that are directly related to all of this that are not yet in an official pacemaker release. Come to me for the latest rpms and you'll have a less painful experience setting all this up :) -- Vossel Cheers, Gibi -- Russell Bryant ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Nova] Automatic evacuate
- Original Message - On Thu, Oct 16, 2014 at 7:48 PM, Jay Pipes jaypi...@gmail.com wrote: While one of us (Jay or me) speaking for the other and saying we agree is a distributed consensus problem that dwarfs the complexity of Paxos You've always had a way with words, Florian :) I knew you'd like that one. :) , *I* for my part do think that an external toolset (i.e. one that lives outside the Nova codebase) is the better approach versus duplicating the functionality of said toolset in Nova. I just believe that the toolset that should be used here is Corosync/Pacemaker and not Ceilometer/Heat. And I believe the former approach leads to *much* fewer necessary code changes *in* Nova than the latter. I agree with you that Corosync/Pacemaker is the tool of choice for monitoring/heartbeat functionality, and is my choice for compute-node-level HA monitoring. For guest-level HA monitoring, I would say use Heat/Ceilometer. For container-level HA monitoring, it looks like fleet or something like Kubernetes would be a good option. Here's why I think that's a bad idea: none of these support the concept of being subordinate to another cluster. Again, suppose a VM stops responding. Then Heat/Ceilometer/Kubernetes/fleet would need to know whether the node hosting the VM is down or not. Only if the node is up or recovered (which Pacemaker would be reponsible for) the VM HA facility would be able to kick in. Effectively you have two views of the cluster membership, and that sort of thing always gets messy. In the HA space we're always facing the same issues when a replication facility (Galera, GlusterFS, DRBD, whatever) has a different view of the cluster membership than the cluster manager itself — which *always* happens for a few seconds on any failover, recovery, or fencing event. Russell's suggestion, by having remote Pacemaker instances on the compute nodes tie in with a Pacemaker cluster on the control nodes, does away with that discrepancy. I'm curious to see how the combination of compute-node-level HA and container-level HA tools will work together in some of the proposed deployment architectures (bare metal + docker containers w/ OpenStack and infrastructure services run in a Kubernetes pod or CoreOS fleet). I have absolutely nothing against an OpenStack cluster using *exclusively* Kubernetes or fleet for HA management, once those have reached sufficient maturity. It's not about reaching sufficient maturity for these two projects. They are on the wrong path to achieve proper HA. Kubernetes and fleet (i'll throw geard into the mix as well) do a great job at distributed management of containers. The difference is instead of integrating with a proper HA stack (like Nova is) kubernetes and fleet are attempting their own HA. In doing this, they've unknowingly blown the scope of their respective projects way beyond what they originally set out to do. Here's the problem. HA is both very misunderstood and deceivingly difficult to achieve. System wide deterministic failover behavior is not a matter of monitoring and restarting failed containers. For kubernetes and fleet to succeed, they will need to integrate with a proper HA stack like pacemaker. Below are some presentation slides on how I envision pacemaker interacting with container orchestration tools. https://github.com/davidvossel/phd/blob/master/doc/presentations/HA_Container_Overview_David_Vossel.pdf?raw=true -- Vossel But just about every significant OpenStack distro out there has settled on Corosync/Pacemaker for the time being. Let's not shove another cluster manager down their throats for little to no real benefit. Cheers, Florian ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [kolla] on Dockerfile patterns
- Original Message - I'm not arguing that everything should be managed by one systemd, I'm just saying, for certain types of containers, a single docker container with systemd in it might be preferable to trying to slice it unnaturally into several containers. Systemd has invested a lot of time/effort to be able to relaunch failed services, support spawning and maintaining unix sockets and services across them, etc, that you'd have to push out of and across docker containers. All of that can be done, but why reinvent the wheel? Like you said, pacemaker can be made to make it all work, but I have yet to see a way to deploy pacemaker services anywhere near as easy as systemd+yum makes it. (Thanks be to redhat. :) The answer seems to be, its not dockerish. Thats ok. I just wanted to understand the issue for what it is. If there is a really good reason for not wanting to do it, or that its just not the way things are done. I've had kind of the opposite feeling regarding docker containers. Docker use to do very bad things when killing the container. nasty if you wanted your database not to go corrupt. killing pid 1 is a bit sketchy then forcing the container down after 10 seconds was particularly bad. having something like systemd in place allows the database to be notified, then shutdown properly. Sure you can script up enough shell to make this work, but you have to do some difficult code, over and over again... Docker has gotten better more recently but it still makes me a bit nervous using it for statefull things. As for recovery, systemd can do the recovery too. I'd argue at this point in time, I'd expect systemd recovery to probably work better then some custom yes, systemd can do recovery and that is part of the problem. From my perspective there should be one resource management system. Whether that be pacemaker, kubernetes, or some other distributed system, it doesn't matter. If you are mixing systemd with these other external distributed orchestration/management tools you have containers that are silently failing/recovering without the management layer having any clue. centralized recovery. There's one tool responsible for detecting and invoking recovery. Everything else in the system is designed to make that possible. If we want to put a process in the container to manage multiple services, we'd need the ability to escalate failures to the distributed management tool. Systemd could work if it was given the ability to act more as a watchdog after starting services than invoke recovery. If systemd could be configured to die (or potentially gracefully cleanup the containers resource's before dieing) whenever a failure is detected, then systemd might make sense. I'm approaching this from a system management point of view. Running systemd in your one off container that you're managing manually does not have the same drawbacks. I don't have a vendetta against systemd or anything, I just think it's a step backwards to put systemd in containers. I see little value in having containers become lightweight virtual machines. Containers have much more to offer. -- Vossel shell scripts when it comes to doing the right thing recovering at bring up. The other thing is, recovery is not just about pid1 going away. often it sticks around and other badness is going on. Its A way to know things are bad, but you can't necessarily rely on it to know the container's healty. You need more robust checks for that. Thanks, Kevin From: David Vossel [dvos...@redhat.com] Sent: Tuesday, October 14, 2014 4:52 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [kolla] on Dockerfile patterns - Original Message - Ok, why are you so down on running systemd in a container? It goes against the grain. From a distributed systems view, we gain quite a bit of control by maintaining one service per container. Containers can be re-organised and re-purposed dynamically. If we have systemd trying to manage an entire stack of resources within a container, we lose this control. From my perspective a containerized application stack needs to be managed externally by whatever is orchestrating the containers to begin with. When we take a step back and look at how we actually want to deploy containers, systemd doesn't make much sense. It actually limits us in the long run. Also... recovery. Using systemd to manage a stack of resources within a single container makes it difficult for whatever is externally enforcing the availability of that container to detect the health of the container. As it is now, the actual service is pid 1 of a container. If that service dies, the container dies. If systemd is pid 1, there can be all kinds of chaos occurring within the container, but the external distributed orchestration system won't have a clue
Re: [openstack-dev] [kolla] on Dockerfile patterns
- Original Message - On Tue, 2014-10-14 at 19:52 -0400, David Vossel wrote: - Original Message - Ok, why are you so down on running systemd in a container? It goes against the grain. From a distributed systems view, we gain quite a bit of control by maintaining one service per container. Containers can be re-organised and re-purposed dynamically. If we have systemd trying to manage an entire stack of resources within a container, we lose this control. From my perspective a containerized application stack needs to be managed externally by whatever is orchestrating the containers to begin with. When we take a step back and look at how we actually want to deploy containers, systemd doesn't make much sense. It actually limits us in the long run. Also... recovery. Using systemd to manage a stack of resources within a single container makes it difficult for whatever is externally enforcing the availability of that container to detect the health of the container. As it is now, the actual service is pid 1 of a container. If that service dies, the container dies. If systemd is pid 1, there can be all kinds of chaos occurring within the container, but the external distributed orchestration system won't have a clue (unless it invokes some custom health monitoring tools within the container itself, which will likely be the case someday.) I don't really think this is a good argument. If you're using docker, docker is the management and orchestration system for the containers. no, docker is a local tool for pulling images and launching containers. Docker is not the distributed resource manager in charge of overseeing what machines launch what containers and how those containers are linked together. There's no dogmatic answer to the question should you run init in the container. an init daemon might make sense to put in some containers where we have a tightly coupled resource stack. There could be a use case where it would make more sense to put these resources in a single container. I don't think systemd is a good solution for the init daemon though. Systemd attempts to handle recovery itself as if it has the entire view of the system. With containers, the system view exists outside of the containers. If we put an internal init daemon within the containers, that daemon needs to escalate internal failures. The easiest way to do this is to have init die if it encounters a resource failure (init is pid 1, pid 1 exiting causes container to exit, container exiting gets the attention of whatever is managing the containers) The reason for not running init inside a container managed by docker is that you want the template to be thin for ease of orchestration and transfer, so you want to share as much as possible with the host. The more junk you put into the container, the fatter and less agile it becomes, so you should probably share the init system with the host in this paradigm. I don't think the local init system and containers should have anything to do with one another. I said this in a previous reply, I'm approaching this problem from a distributed management perspective. The host's init daemon only has a local view of the world. Conversely, containers can be used to virtualize full operating systems. This isn't the standard way of doing docker, but LXC and OpenVZ by default do containers this way. For this type of container, because you have a full OS running inside the container, you have to also have systemd (assuming it's the init system) running within the container. sure, if you want to do this use systemd. I don't understand the use case where this makes any sense though. For me this falls in the yeah you can do it, but why? category. -- Vossel James ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [kolla] on Dockerfile patterns
- Original Message - Excerpts from Vishvananda Ishaya's message of 2014-10-15 07:52:34 -0700: On Oct 14, 2014, at 1:12 PM, Clint Byrum cl...@fewbar.com wrote: Excerpts from Lars Kellogg-Stedman's message of 2014-10-14 12:50:48 -0700: On Tue, Oct 14, 2014 at 03:25:56PM -0400, Jay Pipes wrote: I think the above strategy is spot on. Unfortunately, that's not how the Docker ecosystem works. I'm not sure I agree here, but again nobody is forcing you to use this tool. operating system that the image is built for. I see you didn't respond to my point that in your openstack-containers environment, you end up with Debian *and* Fedora images, since you use the official MySQL dockerhub image. And therefore you will end up needing to know sysadmin specifics (such as how network interfaces are set up) on multiple operating system distributions. I missed that part, but ideally you don't *care* about the distribution in use. All you care about is the application. Your container environment (docker itself, or maybe a higher level abstraction) sets up networking for you, and away you go. If you have to perform system administration tasks inside your containers, my general feeling is that something is wrong. Speaking as a curmudgeon ops guy from back in the day.. the reason I choose the OS I do is precisely because it helps me _when something is wrong_. And the best way an OS can help me is to provide excellent debugging tools, and otherwise move out of the way. When something _is_ wrong and I want to attach GDB to mysqld in said container, I could build a new container with debugging tools installed, but that may lose the very system state that I'm debugging. So I need to run things inside the container like apt-get or yum to install GDB.. and at some point you start to realize that having a whole OS is actually a good thing even if it means needing to think about a few more things up front, such as which OS will I use? and what tools do I need installed in my containers? What I mean to say is, just grabbing off the shelf has unstated consequences. If this is how people are going to use and think about containers, I would submit they are a huge waste of time. The performance value they offer is dramatically outweighed by the flexibilty and existing tooling that exists for virtual machines. As I state in my blog post[1] if we really want to get value from containers, we must convert to the single application per container view. This means having standard ways of doing the above either on the host machine or in a debugging container that is as easy (or easier) than the workflow you mention. There are not good ways to do this yet, and the community hand-waves it away, saying things like, well you could …”. You could isn’t good enough. The result is that a lot of people that are using containers today are doing fat containers with a full os. I think we really agree. What the container universe hasn't worked out is all the stuff that the distros have worked out for a long time now: consistency. I agree we need consistency. I have an idea. What if we developed an entrypoint script standard... Something like LSB init scripts except tailored towards the container use case. The primary difference would be that the 'start' action of this new standard wouldn't fork. Instead 'start' would be pid 1. The 'status' could be checked externally by calling the exact same entry point script to invoke the 'status' function. This standard would lock us into the 'one service per container' concept while giving us the ability to standardize on how the container is launched and monitored. If we all conformed to something like this, docker could even extend the standard so health checks could be performed using the docker cli tool. docker status container id Internally docker would just be doing a nsenter into the container and calling the internal status function in our init script standard. We already have docker start container and docker stop container. Being able to generically call something like docker status container and have that translate into some service specific command on the inside of the container would be kind of neat. Tools like kubernetes could use this functionality to poll a container's health and be able to detect issues occurring within the container that don't necessarily involve the container's service failing. Does anyone else have any interest in this? I have quite a bit of of init script type standard experience. It would be trivial for me to define something like this for us to begin discussing. -- Vossel I think it would be a good idea for containers' filesystem contents to be a whole distro. What's at question in this thread is what should be running. If we can just chroot into the container's FS
Re: [openstack-dev] [kolla] on Dockerfile patterns
- Original Message - Same thing works with cloud init too... I've been waiting on systemd working inside a container for a while. it seems to work now. oh no... The idea being its hard to write a shell script to get everything up and running with all the interactions that may need to happen. The init system's already designed for that. Take a nova-compute docker container for example, you probably need nova-compute, libvirt, neutron-openvswitch-agent, and the celiometer-agent all backed in. Writing a shell script to get it all started and shut down properly would be really ugly. You could split it up into 4 containers and try and ensure they are coscheduled and all the pieces are able to talk to each other, but why? Putting them all in one container with systemd starting the subprocesses is much easier and shouldn't have many drawbacks. The components code is designed and tested assuming the pieces are all together. What you need is a dependency model that is enforced outside of the containers. Something that manages the order containers are started/stopped/recovered in. This allows you to isolate your containers with 1 service per container, yet still express that container with service A needs to start before container with service B. Pacemaker does this easily. There's even a docker resource-agent for Pacemaker now. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/docker -- Vossel ps. don't run systemd in a container... If you think you should, talk to me first. You can even add a ssh server in there easily too and then ansible in to do whatever other stuff you want to do to the container like add other monitoring and such Ansible or puppet or whatever should work better in this arrangement too since existing code assumes you can just systemctl start foo; Kevin From: Lars Kellogg-Stedman [l...@redhat.com] Sent: Tuesday, October 14, 2014 12:10 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [kolla] on Dockerfile patterns On Tue, Oct 14, 2014 at 02:45:30PM -0400, Jay Pipes wrote: With Docker, you are limited to the operating system of whatever the image uses. See, that's the part I disagree with. What I was saying about ansible and puppet in my email is that I think the right thing to do is take advantage of those tools: FROM ubuntu RUN apt-get install ansible COPY my_ansible_config.yaml /my_ansible_config.yaml RUN ansible /my_ansible_config.yaml Or: FROM Fedora RUN yum install ansible COPY my_ansible_config.yaml /my_ansible_config.yaml RUN ansible /my_ansible_config.yaml Put the minimal instructions in your dockerfile to bootstrap your preferred configuration management tool. This is exactly what you would do when booting, say, a Nova instance into an openstack environment: you can provide a shell script to cloud-init that would install whatever packages are required to run your config management tool, and then run that tool. Once you have bootstrapped your cm environment you can take advantage of all those distribution-agnostic cm tools. In other words, using docker is no more limiting than using a vm or bare hardware that has been installed with your distribution of choice. [1] Is there an official MySQL docker image? I found 553 Dockerhub repositories for MySQL images... Yes, it's called mysql. It is in fact one of the official images highlighted on https://registry.hub.docker.com/. I have looked into using Puppet as part of both the build and runtime configuration process, but I haven't spent much time on it yet. Oh, I don't think Puppet is any better than Ansible for these things. I think it's pretty clear that I was not suggesting it was better than ansible. That is hardly relevant to this discussion. I was only saying that is what *I* have looked at, and I was agreeing that *any* configuration management system is probably better than writing shells cript. How would I go about essentially transferring the ownership of the RPC exchanges that the original nova-conductor container managed to the new nova-conductor container? Would it be as simple as shutting down the old container and starting up the new nova-conductor container using things like --link rabbitmq:rabbitmq in the startup docker line? I think that you would not necessarily rely on --link for this sort of thing. Under kubernetes, you would use a service definition, in which kubernetes maintains a proxy that directs traffic to the appropriate place as containers are created and destroyed. Outside of kubernetes, you would use some other service discovery mechanism; there are many available (etcd, consul, serf, etc). But this isn't particularly a docker problem. This is the same problem you would face running the same software on top of a cloud
Re: [openstack-dev] [kolla] on Dockerfile patterns
- Original Message - Ok, why are you so down on running systemd in a container? It goes against the grain. From a distributed systems view, we gain quite a bit of control by maintaining one service per container. Containers can be re-organised and re-purposed dynamically. If we have systemd trying to manage an entire stack of resources within a container, we lose this control. From my perspective a containerized application stack needs to be managed externally by whatever is orchestrating the containers to begin with. When we take a step back and look at how we actually want to deploy containers, systemd doesn't make much sense. It actually limits us in the long run. Also... recovery. Using systemd to manage a stack of resources within a single container makes it difficult for whatever is externally enforcing the availability of that container to detect the health of the container. As it is now, the actual service is pid 1 of a container. If that service dies, the container dies. If systemd is pid 1, there can be all kinds of chaos occurring within the container, but the external distributed orchestration system won't have a clue (unless it invokes some custom health monitoring tools within the container itself, which will likely be the case someday.) -- Vossel Pacemaker works, but its kind of a pain to setup compared just yum installing a few packages and setting init to systemd. There are some benefits for sure, but if you have to force all the docker components onto the same physical machine anyway, why bother with the extra complexity? Thanks, Kevin From: David Vossel [dvos...@redhat.com] Sent: Tuesday, October 14, 2014 3:14 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [kolla] on Dockerfile patterns - Original Message - Same thing works with cloud init too... I've been waiting on systemd working inside a container for a while. it seems to work now. oh no... The idea being its hard to write a shell script to get everything up and running with all the interactions that may need to happen. The init system's already designed for that. Take a nova-compute docker container for example, you probably need nova-compute, libvirt, neutron-openvswitch-agent, and the celiometer-agent all backed in. Writing a shell script to get it all started and shut down properly would be really ugly. You could split it up into 4 containers and try and ensure they are coscheduled and all the pieces are able to talk to each other, but why? Putting them all in one container with systemd starting the subprocesses is much easier and shouldn't have many drawbacks. The components code is designed and tested assuming the pieces are all together. What you need is a dependency model that is enforced outside of the containers. Something that manages the order containers are started/stopped/recovered in. This allows you to isolate your containers with 1 service per container, yet still express that container with service A needs to start before container with service B. Pacemaker does this easily. There's even a docker resource-agent for Pacemaker now. https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/docker -- Vossel ps. don't run systemd in a container... If you think you should, talk to me first. You can even add a ssh server in there easily too and then ansible in to do whatever other stuff you want to do to the container like add other monitoring and such Ansible or puppet or whatever should work better in this arrangement too since existing code assumes you can just systemctl start foo; Kevin From: Lars Kellogg-Stedman [l...@redhat.com] Sent: Tuesday, October 14, 2014 12:10 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [kolla] on Dockerfile patterns On Tue, Oct 14, 2014 at 02:45:30PM -0400, Jay Pipes wrote: With Docker, you are limited to the operating system of whatever the image uses. See, that's the part I disagree with. What I was saying about ansible and puppet in my email is that I think the right thing to do is take advantage of those tools: FROM ubuntu RUN apt-get install ansible COPY my_ansible_config.yaml /my_ansible_config.yaml RUN ansible /my_ansible_config.yaml Or: FROM Fedora RUN yum install ansible COPY my_ansible_config.yaml /my_ansible_config.yaml RUN ansible /my_ansible_config.yaml Put the minimal instructions in your dockerfile to bootstrap your preferred configuration management tool. This is exactly what you would do when booting, say, a Nova instance into an openstack environment: you can provide a shell script to cloud-init that would install whatever packages are required to run