Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage

2014-11-14 Thread Chris Jones
Hi

My thoughts:

Shoe-horning the ephemeral partition into Cinder seems like a lot of pain for 
almost no gain[1]. The only gain I can think of would be that we could bring a 
node down, boot it into a special ramdisk that exposes the volume to the 
network, so cindery operations (e.g. migration) could be performed, but I'm not 
even sure if anyone is asking for that?

Forcing Cinder to understand and track something it can never normally do 
anything with, seems like we're just trying to squeeze ourselves into an 
ever-shrinking VM costume!

Having said that, preserve ephemeral is a terrible oxymoron, so if we can do 
something about it, we probably should.

How about instead, we teach Nova/Ironic about a concept of no ephemeral? They 
make a partition on the first disk for the first image they deploy, and then 
they never touch the other part(s) of the disk(s), until the instance is 
destroyed. This creates one additional burden for operators, which is to create 
and format a partition the first time they boot, but since this is a very small 
number of commands, and something we could trivially bake into our (root?) 
elements, I'm not sure it's a huge problem.

This gets rid of the cognitive dissonance of preserving something that is 
described as ephemeral, and (IMO) makes it extremely clear that OpenStack isn't 
going to touch anything but the first partition of the first disk. If this were 
baked into the flavour rather than something we tack onto a nova rebuild 
command, it offers greater safety for operators, against the risk of 
accidentallying a vital state partition with a misconstructed rebuild command.


[1] for local disk, I mean. I still think it'd be nice for operators to be able 
to use a networked Cinder volume for /mnt/state/, but that presents a whole 
different set of challenges :)

Cheers,
--
Chris Jones

 On 13 Nov 2014, at 09:25, Robert Collins robe...@robertcollins.net wrote:
 
 Back in the day before the ephemeral hack (though that was something
 folk have said they would like for libvirt too - so its not such a
 hack per-se) this was (broadly) sketched out. We spoke with the cinder
 PTL at the time in portland, from memory.
 
 There was no spec, so here is my brain-dumpy-recollection...
 
 - actual volumes are a poor match because we wouldn't be running
 cinder-volume on an ongoing basis and service records would accumulate
 etc.
 - we'd need cross-service scheduler support to make cinder operations
 line up with allocated bare metal nodes (and to e.g. make sure both
 our data volume and golden image volume are scheduled to the same
 machine).
 
 - folk want to be able to do fairly arbitrary RAID( JBOD) setups and
 that affects scheduling as well, one way to work it is to have Ironic
 export capabilities and specify actual RAID setups via matching
 flavors - this is the direction the ephemeral work took us, and is
 conceptually straight forwardly extended to RAID. We did talk about
 doing a little JSON schema to describe RAID / volume layouts, which
 cinder could potentially use for user defined volume flavors too.
 
 One thing I think that is missing from your description is in this: 
 
 To be clear, in TripleO, we need a way to keep the data on a local
 direct attached storage device while deploying a new image to the box.
 
 I think we need to be able to do this with a single drive shared
 between image and data - doing one disk image, one disk data would add
 substantial waste given the size of disks these days (and for some
 form factors like moonshot it would rule out using them at all).
 
 Of course, being able to do entirely network stored golden images
 might be something some deployments want, but we can't require them
 all to do that ;)
 
 -Rob
 
 
 
 On 13 November 2014 11:30, Clint Byrum cl...@fewbar.com wrote:
 Each summit since we created preserve ephemeral mode in Nova, I have
 some conversations where at least one person's brain breaks for a
 second. There isn't always alcohol involved before, there almost
 certainly is always a drink needed after. The very term is vexing, and I
 think we have done ourselves a disservice to have it, even if it was the
 best option at the time.
 
 To be clear, in TripleO, we need a way to keep the data on a local
 direct attached storage device while deploying a new image to the box.
 If we were on VMs, we'd attach volumes, and just deploy new VMs and move
 the volume over. If we had a SAN, we'd just move the LUN's. But at some
 point when you deploy a cloud you're holding data that is expensive to
 replicate all at once, and so you'd rather just keep using the same
 server instead of trying to move the data.
 
 Since we don't have baremetal Cinder, we had to come up with a way to
 do this, so we used Nova rebuild, and slipped it a special command that
 said don't overwrite the partition you'd normally make the 'ephemeral'
 partition. This works fine, but it is confusing and limiting. We'd like
 something better.
 
 I had an 

Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage

2014-11-14 Thread Clint Byrum
Excerpts from Chris Jones's message of 2014-11-14 00:42:48 -0800:
 Hi
 
 My thoughts:
 
 Shoe-horning the ephemeral partition into Cinder seems like a lot of pain for 
 almost no gain[1]. The only gain I can think of would be that we could bring 
 a node down, boot it into a special ramdisk that exposes the volume to the 
 network, so cindery operations (e.g. migration) could be performed, but I'm 
 not even sure if anyone is asking for that?
 
 Forcing Cinder to understand and track something it can never normally do 
 anything with, seems like we're just trying to squeeze ourselves into an 
 ever-shrinking VM costume!
 
 Having said that, preserve ephemeral is a terrible oxymoron, so if we can 
 do something about it, we probably should.
 
 How about instead, we teach Nova/Ironic about a concept of no ephemeral? 
 They make a partition on the first disk for the first image they deploy, and 
 then they never touch the other part(s) of the disk(s), until the instance is 
 destroyed. This creates one additional burden for operators, which is to 
 create and format a partition the first time they boot, but since this is a 
 very small number of commands, and something we could trivially bake into our 
 (root?) elements, I'm not sure it's a huge problem.
 
 This gets rid of the cognitive dissonance of preserving something that is 
 described as ephemeral, and (IMO) makes it extremely clear that OpenStack 
 isn't going to touch anything but the first partition of the first disk. If 
 this were baked into the flavour rather than something we tack onto a nova 
 rebuild command, it offers greater safety for operators, against the risk of 
 accidentallying a vital state partition with a misconstructed rebuild command.
 

+1

A predictable and simple rule seems like it would go a long way to
decoupling state preservation from rebuild, which I like very much.

There is, of course, the issue of decom then, but that has never been a
concern for TripleO, and for OnMetal, they think we're a bit daft trying
to preserve state while delivering new images anyway. :)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage

2014-11-14 Thread Josh Gachnang
For decom (now zapping), I'm building it with config flags to either
disable it entirely, or just disable the erase_disks steps. No comment on
the daft bit :) But I do understand why you'd want to do it this way.

https://review.openstack.org/#/c/102685/

On Fri Nov 14 2014 at 6:14:13 AM Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Chris Jones's message of 2014-11-14 00:42:48 -0800:
  Hi
 
  My thoughts:
 
  Shoe-horning the ephemeral partition into Cinder seems like a lot of
 pain for almost no gain[1]. The only gain I can think of would be that we
 could bring a node down, boot it into a special ramdisk that exposes the
 volume to the network, so cindery operations (e.g. migration) could be
 performed, but I'm not even sure if anyone is asking for that?
 
  Forcing Cinder to understand and track something it can never normally
 do anything with, seems like we're just trying to squeeze ourselves into an
 ever-shrinking VM costume!
 
  Having said that, preserve ephemeral is a terrible oxymoron, so if we
 can do something about it, we probably should.
 
  How about instead, we teach Nova/Ironic about a concept of no
 ephemeral? They make a partition on the first disk for the first image
 they deploy, and then they never touch the other part(s) of the disk(s),
 until the instance is destroyed. This creates one additional burden for
 operators, which is to create and format a partition the first time they
 boot, but since this is a very small number of commands, and something we
 could trivially bake into our (root?) elements, I'm not sure it's a huge
 problem.
 
  This gets rid of the cognitive dissonance of preserving something that
 is described as ephemeral, and (IMO) makes it extremely clear that
 OpenStack isn't going to touch anything but the first partition of the
 first disk. If this were baked into the flavour rather than something we
 tack onto a nova rebuild command, it offers greater safety for operators,
 against the risk of accidentallying a vital state partition with a
 misconstructed rebuild command.
 

 +1

 A predictable and simple rule seems like it would go a long way to
 decoupling state preservation from rebuild, which I like very much.

 There is, of course, the issue of decom then, but that has never been a
 concern for TripleO, and for OnMetal, they think we're a bit daft trying
 to preserve state while delivering new images anyway. :)

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage

2014-11-13 Thread Duncan Thomas
The problem with considering it a cinder volume rather than a nova
ephemeral volume is that it is just as leaky a set of semantics -
cinder volumes can be detached, attached elsewhere, snapshotted,
backed up, etc - a directly connected bare metal drive will be able to
do none of these things.

That said, the upcoming cinder-agent code might be of use - it is
designed to provide discovery and an API around local storage - but
mapping bare metal drives as cinder volumes is really no better than
mapping them as nova ephemeral drives - in both cases they don't match
the semantics. I'd rather not bend the cinder semantics out of shape
to clean up the nova ones.



On 13 November 2014 00:30, Clint Byrum cl...@fewbar.com wrote:
 Each summit since we created preserve ephemeral mode in Nova, I have
 some conversations where at least one person's brain breaks for a
 second. There isn't always alcohol involved before, there almost
 certainly is always a drink needed after. The very term is vexing, and I
 think we have done ourselves a disservice to have it, even if it was the
 best option at the time.

 To be clear, in TripleO, we need a way to keep the data on a local
 direct attached storage device while deploying a new image to the box.
 If we were on VMs, we'd attach volumes, and just deploy new VMs and move
 the volume over. If we had a SAN, we'd just move the LUN's. But at some
 point when you deploy a cloud you're holding data that is expensive to
 replicate all at once, and so you'd rather just keep using the same
 server instead of trying to move the data.

 Since we don't have baremetal Cinder, we had to come up with a way to
 do this, so we used Nova rebuild, and slipped it a special command that
 said don't overwrite the partition you'd normally make the 'ephemeral'
 partition. This works fine, but it is confusing and limiting. We'd like
 something better.

 I had an interesting discussion with Devananda in which he suggested an
 alternative approach. If we were to bring up cinder-volume on our deploy
 ramdisks, and configure it in such a way that it claimed ownership of
 the section of disk we'd like to preserve, then we could allocate that
 storage as a volume. From there, we could boot from volume, or attach
 the volume to the instance (which would really just tell us how to find
 the volume). When we want to write a new image, we can just delete the old
 instance and create a new one, scheduled to wherever that volume already
 is. This would require the nova scheduler to have a filter available
 where we could select a host by the volumes it has, so we can make sure to
 send the instance request back to the box that still has all of the data.

 Alternatively we can keep on using rebuild, but let the volume model the
 preservation rather than our special case.

 Thoughts? Suggestions? I feel like this might take some time, but it is
 necessary to consider it now so we can drive any work we need to get it
 done soon.

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



-- 
Duncan Thomas

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [TripleO] [Ironic] [Cinder] Baremetal volumes -- how to model direct attached storage

2014-11-12 Thread Clint Byrum
Each summit since we created preserve ephemeral mode in Nova, I have
some conversations where at least one person's brain breaks for a
second. There isn't always alcohol involved before, there almost
certainly is always a drink needed after. The very term is vexing, and I
think we have done ourselves a disservice to have it, even if it was the
best option at the time.

To be clear, in TripleO, we need a way to keep the data on a local
direct attached storage device while deploying a new image to the box.
If we were on VMs, we'd attach volumes, and just deploy new VMs and move
the volume over. If we had a SAN, we'd just move the LUN's. But at some
point when you deploy a cloud you're holding data that is expensive to
replicate all at once, and so you'd rather just keep using the same
server instead of trying to move the data.

Since we don't have baremetal Cinder, we had to come up with a way to
do this, so we used Nova rebuild, and slipped it a special command that
said don't overwrite the partition you'd normally make the 'ephemeral'
partition. This works fine, but it is confusing and limiting. We'd like
something better.

I had an interesting discussion with Devananda in which he suggested an
alternative approach. If we were to bring up cinder-volume on our deploy
ramdisks, and configure it in such a way that it claimed ownership of
the section of disk we'd like to preserve, then we could allocate that
storage as a volume. From there, we could boot from volume, or attach
the volume to the instance (which would really just tell us how to find
the volume). When we want to write a new image, we can just delete the old
instance and create a new one, scheduled to wherever that volume already
is. This would require the nova scheduler to have a filter available
where we could select a host by the volumes it has, so we can make sure to
send the instance request back to the box that still has all of the data.

Alternatively we can keep on using rebuild, but let the volume model the
preservation rather than our special case.

Thoughts? Suggestions? I feel like this might take some time, but it is
necessary to consider it now so we can drive any work we need to get it
done soon.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev