Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage
Hi My thoughts: Shoe-horning the ephemeral partition into Cinder seems like a lot of pain for almost no gain[1]. The only gain I can think of would be that we could bring a node down, boot it into a special ramdisk that exposes the volume to the network, so cindery operations (e.g. migration) could be performed, but I'm not even sure if anyone is asking for that? Forcing Cinder to understand and track something it can never normally do anything with, seems like we're just trying to squeeze ourselves into an ever-shrinking VM costume! Having said that, preserve ephemeral is a terrible oxymoron, so if we can do something about it, we probably should. How about instead, we teach Nova/Ironic about a concept of no ephemeral? They make a partition on the first disk for the first image they deploy, and then they never touch the other part(s) of the disk(s), until the instance is destroyed. This creates one additional burden for operators, which is to create and format a partition the first time they boot, but since this is a very small number of commands, and something we could trivially bake into our (root?) elements, I'm not sure it's a huge problem. This gets rid of the cognitive dissonance of preserving something that is described as ephemeral, and (IMO) makes it extremely clear that OpenStack isn't going to touch anything but the first partition of the first disk. If this were baked into the flavour rather than something we tack onto a nova rebuild command, it offers greater safety for operators, against the risk of accidentallying a vital state partition with a misconstructed rebuild command. [1] for local disk, I mean. I still think it'd be nice for operators to be able to use a networked Cinder volume for /mnt/state/, but that presents a whole different set of challenges :) Cheers, -- Chris Jones On 13 Nov 2014, at 09:25, Robert Collins robe...@robertcollins.net wrote: Back in the day before the ephemeral hack (though that was something folk have said they would like for libvirt too - so its not such a hack per-se) this was (broadly) sketched out. We spoke with the cinder PTL at the time in portland, from memory. There was no spec, so here is my brain-dumpy-recollection... - actual volumes are a poor match because we wouldn't be running cinder-volume on an ongoing basis and service records would accumulate etc. - we'd need cross-service scheduler support to make cinder operations line up with allocated bare metal nodes (and to e.g. make sure both our data volume and golden image volume are scheduled to the same machine). - folk want to be able to do fairly arbitrary RAID( JBOD) setups and that affects scheduling as well, one way to work it is to have Ironic export capabilities and specify actual RAID setups via matching flavors - this is the direction the ephemeral work took us, and is conceptually straight forwardly extended to RAID. We did talk about doing a little JSON schema to describe RAID / volume layouts, which cinder could potentially use for user defined volume flavors too. One thing I think that is missing from your description is in this: To be clear, in TripleO, we need a way to keep the data on a local direct attached storage device while deploying a new image to the box. I think we need to be able to do this with a single drive shared between image and data - doing one disk image, one disk data would add substantial waste given the size of disks these days (and for some form factors like moonshot it would rule out using them at all). Of course, being able to do entirely network stored golden images might be something some deployments want, but we can't require them all to do that ;) -Rob On 13 November 2014 11:30, Clint Byrum cl...@fewbar.com wrote: Each summit since we created preserve ephemeral mode in Nova, I have some conversations where at least one person's brain breaks for a second. There isn't always alcohol involved before, there almost certainly is always a drink needed after. The very term is vexing, and I think we have done ourselves a disservice to have it, even if it was the best option at the time. To be clear, in TripleO, we need a way to keep the data on a local direct attached storage device while deploying a new image to the box. If we were on VMs, we'd attach volumes, and just deploy new VMs and move the volume over. If we had a SAN, we'd just move the LUN's. But at some point when you deploy a cloud you're holding data that is expensive to replicate all at once, and so you'd rather just keep using the same server instead of trying to move the data. Since we don't have baremetal Cinder, we had to come up with a way to do this, so we used Nova rebuild, and slipped it a special command that said don't overwrite the partition you'd normally make the 'ephemeral' partition. This works fine, but it is confusing and limiting. We'd like something better. I had an
Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage
Excerpts from Chris Jones's message of 2014-11-14 00:42:48 -0800: Hi My thoughts: Shoe-horning the ephemeral partition into Cinder seems like a lot of pain for almost no gain[1]. The only gain I can think of would be that we could bring a node down, boot it into a special ramdisk that exposes the volume to the network, so cindery operations (e.g. migration) could be performed, but I'm not even sure if anyone is asking for that? Forcing Cinder to understand and track something it can never normally do anything with, seems like we're just trying to squeeze ourselves into an ever-shrinking VM costume! Having said that, preserve ephemeral is a terrible oxymoron, so if we can do something about it, we probably should. How about instead, we teach Nova/Ironic about a concept of no ephemeral? They make a partition on the first disk for the first image they deploy, and then they never touch the other part(s) of the disk(s), until the instance is destroyed. This creates one additional burden for operators, which is to create and format a partition the first time they boot, but since this is a very small number of commands, and something we could trivially bake into our (root?) elements, I'm not sure it's a huge problem. This gets rid of the cognitive dissonance of preserving something that is described as ephemeral, and (IMO) makes it extremely clear that OpenStack isn't going to touch anything but the first partition of the first disk. If this were baked into the flavour rather than something we tack onto a nova rebuild command, it offers greater safety for operators, against the risk of accidentallying a vital state partition with a misconstructed rebuild command. +1 A predictable and simple rule seems like it would go a long way to decoupling state preservation from rebuild, which I like very much. There is, of course, the issue of decom then, but that has never been a concern for TripleO, and for OnMetal, they think we're a bit daft trying to preserve state while delivering new images anyway. :) ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage
For decom (now zapping), I'm building it with config flags to either disable it entirely, or just disable the erase_disks steps. No comment on the daft bit :) But I do understand why you'd want to do it this way. https://review.openstack.org/#/c/102685/ On Fri Nov 14 2014 at 6:14:13 AM Clint Byrum cl...@fewbar.com wrote: Excerpts from Chris Jones's message of 2014-11-14 00:42:48 -0800: Hi My thoughts: Shoe-horning the ephemeral partition into Cinder seems like a lot of pain for almost no gain[1]. The only gain I can think of would be that we could bring a node down, boot it into a special ramdisk that exposes the volume to the network, so cindery operations (e.g. migration) could be performed, but I'm not even sure if anyone is asking for that? Forcing Cinder to understand and track something it can never normally do anything with, seems like we're just trying to squeeze ourselves into an ever-shrinking VM costume! Having said that, preserve ephemeral is a terrible oxymoron, so if we can do something about it, we probably should. How about instead, we teach Nova/Ironic about a concept of no ephemeral? They make a partition on the first disk for the first image they deploy, and then they never touch the other part(s) of the disk(s), until the instance is destroyed. This creates one additional burden for operators, which is to create and format a partition the first time they boot, but since this is a very small number of commands, and something we could trivially bake into our (root?) elements, I'm not sure it's a huge problem. This gets rid of the cognitive dissonance of preserving something that is described as ephemeral, and (IMO) makes it extremely clear that OpenStack isn't going to touch anything but the first partition of the first disk. If this were baked into the flavour rather than something we tack onto a nova rebuild command, it offers greater safety for operators, against the risk of accidentallying a vital state partition with a misconstructed rebuild command. +1 A predictable and simple rule seems like it would go a long way to decoupling state preservation from rebuild, which I like very much. There is, of course, the issue of decom then, but that has never been a concern for TripleO, and for OnMetal, they think we're a bit daft trying to preserve state while delivering new images anyway. :) ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage
The problem with considering it a cinder volume rather than a nova ephemeral volume is that it is just as leaky a set of semantics - cinder volumes can be detached, attached elsewhere, snapshotted, backed up, etc - a directly connected bare metal drive will be able to do none of these things. That said, the upcoming cinder-agent code might be of use - it is designed to provide discovery and an API around local storage - but mapping bare metal drives as cinder volumes is really no better than mapping them as nova ephemeral drives - in both cases they don't match the semantics. I'd rather not bend the cinder semantics out of shape to clean up the nova ones. On 13 November 2014 00:30, Clint Byrum cl...@fewbar.com wrote: Each summit since we created preserve ephemeral mode in Nova, I have some conversations where at least one person's brain breaks for a second. There isn't always alcohol involved before, there almost certainly is always a drink needed after. The very term is vexing, and I think we have done ourselves a disservice to have it, even if it was the best option at the time. To be clear, in TripleO, we need a way to keep the data on a local direct attached storage device while deploying a new image to the box. If we were on VMs, we'd attach volumes, and just deploy new VMs and move the volume over. If we had a SAN, we'd just move the LUN's. But at some point when you deploy a cloud you're holding data that is expensive to replicate all at once, and so you'd rather just keep using the same server instead of trying to move the data. Since we don't have baremetal Cinder, we had to come up with a way to do this, so we used Nova rebuild, and slipped it a special command that said don't overwrite the partition you'd normally make the 'ephemeral' partition. This works fine, but it is confusing and limiting. We'd like something better. I had an interesting discussion with Devananda in which he suggested an alternative approach. If we were to bring up cinder-volume on our deploy ramdisks, and configure it in such a way that it claimed ownership of the section of disk we'd like to preserve, then we could allocate that storage as a volume. From there, we could boot from volume, or attach the volume to the instance (which would really just tell us how to find the volume). When we want to write a new image, we can just delete the old instance and create a new one, scheduled to wherever that volume already is. This would require the nova scheduler to have a filter available where we could select a host by the volumes it has, so we can make sure to send the instance request back to the box that still has all of the data. Alternatively we can keep on using rebuild, but let the volume model the preservation rather than our special case. Thoughts? Suggestions? I feel like this might take some time, but it is necessary to consider it now so we can drive any work we need to get it done soon. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Duncan Thomas ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage
Each summit since we created preserve ephemeral mode in Nova, I have some conversations where at least one person's brain breaks for a second. There isn't always alcohol involved before, there almost certainly is always a drink needed after. The very term is vexing, and I think we have done ourselves a disservice to have it, even if it was the best option at the time. To be clear, in TripleO, we need a way to keep the data on a local direct attached storage device while deploying a new image to the box. If we were on VMs, we'd attach volumes, and just deploy new VMs and move the volume over. If we had a SAN, we'd just move the LUN's. But at some point when you deploy a cloud you're holding data that is expensive to replicate all at once, and so you'd rather just keep using the same server instead of trying to move the data. Since we don't have baremetal Cinder, we had to come up with a way to do this, so we used Nova rebuild, and slipped it a special command that said don't overwrite the partition you'd normally make the 'ephemeral' partition. This works fine, but it is confusing and limiting. We'd like something better. I had an interesting discussion with Devananda in which he suggested an alternative approach. If we were to bring up cinder-volume on our deploy ramdisks, and configure it in such a way that it claimed ownership of the section of disk we'd like to preserve, then we could allocate that storage as a volume. From there, we could boot from volume, or attach the volume to the instance (which would really just tell us how to find the volume). When we want to write a new image, we can just delete the old instance and create a new one, scheduled to wherever that volume already is. This would require the nova scheduler to have a filter available where we could select a host by the volumes it has, so we can make sure to send the instance request back to the box that still has all of the data. Alternatively we can keep on using rebuild, but let the volume model the preservation rather than our special case. Thoughts? Suggestions? I feel like this might take some time, but it is necessary to consider it now so we can drive any work we need to get it done soon. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev