Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On Tue, Jul 05, 2016 at 12:22:33PM +0200, Dmitry Tantsur wrote: > On 07/04/2016 01:42 PM, Steven Hardy wrote: > > Hi Dmitry, > > > > I wanted to revisit this thread, as I see some of these interfaces > > are now posted for review, and I have a couple of questions around > > the naming (specifically for the "provide" action): > > > > On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote: > > > > > The last step before the deployment it to make nodes "available" using the > > > "provide" provisioning action. Such nodes are exposed to nova, and can be > > > deployed to at any moment. No long-running configuration actions should be > > > run in this state. The "manage" action can be used to bring nodes back to > > > "manageable" state for configuration (e.g. reintrospection). > > > > So, I've been reviewing https://review.openstack.org/#/c/334411/ which > > implements support for "openstack overcloud node provide" > > > > I really hate to be the one nitpicking over openstackclient verbiage, but > > I'm a little unsure if the literal translation of this results in an > > intuitive understanding of what happens to the nodes as a result of this > > action. So I wanted to have a broaded discussion before we land the code > > and commit to this interface. > > > > > > > Here, I think the problem is that while the dictionary definition of > > "provide" is "make available for use, supply" (according to google), it > > implies obtaining the node, not just activating it. > > > > So, to me "provide node" implies going and physically getting the node that > > does not yet exist, but AFAICT what this action actually does is takes an > > existing node, and activates it (sets it to "available" state) > > > > I'm worried this could be a source of operator confusion - has this > > discussion already happened in the Ironic community, or is this a TripleO > > specific term? > > Hi, and thanks for the great question. > > As I've already responded on the patch, this term is settled in our OSC > plugin spec [1], and we feel like it reflects the reality pretty well. But I > clearly understand that naming things is really hard, and what feels obvious > to me does not feel obvious to the others. Anyway, I'd prefer if we stay > consistent with how Ironic names things now. > > [1] > http://specs.openstack.org/openstack/ironic-specs/specs/approved/ironicclient-osc-plugin.html Thanks, this is the context I was missing - If the term is already accepted by the ironic community then I agree, let's keep things consistent. Thanks! Steve __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On 07/04/2016 01:42 PM, Steven Hardy wrote: Hi Dmitry, I wanted to revisit this thread, as I see some of these interfaces are now posted for review, and I have a couple of questions around the naming (specifically for the "provide" action): On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote: The last step before the deployment it to make nodes "available" using the "provide" provisioning action. Such nodes are exposed to nova, and can be deployed to at any moment. No long-running configuration actions should be run in this state. The "manage" action can be used to bring nodes back to "manageable" state for configuration (e.g. reintrospection). So, I've been reviewing https://review.openstack.org/#/c/334411/ which implements support for "openstack overcloud node provide" I really hate to be the one nitpicking over openstackclient verbiage, but I'm a little unsure if the literal translation of this results in an intuitive understanding of what happens to the nodes as a result of this action. So I wanted to have a broaded discussion before we land the code and commit to this interface. Here, I think the problem is that while the dictionary definition of "provide" is "make available for use, supply" (according to google), it implies obtaining the node, not just activating it. So, to me "provide node" implies going and physically getting the node that does not yet exist, but AFAICT what this action actually does is takes an existing node, and activates it (sets it to "available" state) I'm worried this could be a source of operator confusion - has this discussion already happened in the Ironic community, or is this a TripleO specific term? Hi, and thanks for the great question. As I've already responded on the patch, this term is settled in our OSC plugin spec [1], and we feel like it reflects the reality pretty well. But I clearly understand that naming things is really hard, and what feels obvious to me does not feel obvious to the others. Anyway, I'd prefer if we stay consistent with how Ironic names things now. [1] http://specs.openstack.org/openstack/ironic-specs/specs/approved/ironicclient-osc-plugin.html To me, something like "openstack overcloud node enable" or maybe "node activate" would be more intuitive, as it implies taking an existing node from the inventory and making it active/available in the context of the overcloud deployment? The problem here is that "provide" does not just "enable" nodes. It also makes nodes pass through cleaning, which may be a pretty complex and long process (we have it disabled for TripleO for this reason). Anyway, not a huge issue, but given that this is a new step in our nodes workflow, I wanted to ensure folks are comfortable with the terminology before we commit to it in code. Thanks! Steve __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On Mon, Jul 4, 2016 at 5:12 PM, Steven Hardy wrote: > Hi Dmitry, > > I wanted to revisit this thread, as I see some of these interfaces > are now posted for review, and I have a couple of questions around > the naming (specifically for the "provide" action): > > On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote: > > > The last step before the deployment it to make nodes "available" using > the > > "provide" provisioning action. Such nodes are exposed to nova, and can be > > deployed to at any moment. No long-running configuration actions should > be > > run in this state. The "manage" action can be used to bring nodes back to > > "manageable" state for configuration (e.g. reintrospection). > > So, I've been reviewing https://review.openstack.org/#/c/334411/ which > implements support for "openstack overcloud node provide" > > I really hate to be the one nitpicking over openstackclient verbiage, but > I'm a little unsure if the literal translation of this results in an > intuitive understanding of what happens to the nodes as a result of this > action. So I wanted to have a broaded discussion before we land the code > and commit to this interface. > > More info below: > > > what do you propose? > > > > > > I would like the new TripleO mistral workflows to start following the > ironic > > state machine closer. Imagine the following workflows: > > > > 1. register: take JSON, create nodes in "manageable" state. I do believe > we > > can automate the enroll->manageable transition, as it serves the purpose > of > > validation (and discovery, but lets put it aside). > > > > 2. provide: take a list of nodes or all "managable" nodes and move them > to > > "available". By using this workflow an operator will make a *conscious* > > decision to add some nodes to the cloud. > > Here, I think the problem is that while the dictionary definition of > "provide" is "make available for use, supply" (according to google), it > implies obtaining the node, not just activating it. > > So, to me "provide node" implies going and physically getting the node that > does not yet exist, but AFAICT what this action actually does is takes an > existing node, and activates it (sets it to "available" state) > > I'm worried this could be a source of operator confusion - has this > discussion already happened in the Ironic community, or is this a TripleO > specific term? > > To me, something like "openstack overcloud node enable" or maybe "node > activate" would be more intuitive, as it implies taking an existing node > from the inventory and making it active/available in the context of the > overcloud deployment? > My 2 cents, as a operator, the part wherein a node is enrolled, manageable, available is a bit confusing to first timers. If we have something more simple ie all baremetal nodes (baremetal nodes are the nodes enrolled or manageable states), all cluster nodes (are either available or deployed states). I do not know if there is a deployed state :) regards /sanjay > > Anyway, not a huge issue, but given that this is a new step in our nodes > workflow, I wanted to ensure folks are comfortable with the terminology > before we commit to it in code. > > Thanks! > > Steve > > __ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
Hi Dmitry, I wanted to revisit this thread, as I see some of these interfaces are now posted for review, and I have a couple of questions around the naming (specifically for the "provide" action): On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote: > The last step before the deployment it to make nodes "available" using the > "provide" provisioning action. Such nodes are exposed to nova, and can be > deployed to at any moment. No long-running configuration actions should be > run in this state. The "manage" action can be used to bring nodes back to > "manageable" state for configuration (e.g. reintrospection). So, I've been reviewing https://review.openstack.org/#/c/334411/ which implements support for "openstack overcloud node provide" I really hate to be the one nitpicking over openstackclient verbiage, but I'm a little unsure if the literal translation of this results in an intuitive understanding of what happens to the nodes as a result of this action. So I wanted to have a broaded discussion before we land the code and commit to this interface. More info below: > what do you propose? > > > I would like the new TripleO mistral workflows to start following the ironic > state machine closer. Imagine the following workflows: > > 1. register: take JSON, create nodes in "manageable" state. I do believe we > can automate the enroll->manageable transition, as it serves the purpose of > validation (and discovery, but lets put it aside). > > 2. provide: take a list of nodes or all "managable" nodes and move them to > "available". By using this workflow an operator will make a *conscious* > decision to add some nodes to the cloud. Here, I think the problem is that while the dictionary definition of "provide" is "make available for use, supply" (according to google), it implies obtaining the node, not just activating it. So, to me "provide node" implies going and physically getting the node that does not yet exist, but AFAICT what this action actually does is takes an existing node, and activates it (sets it to "available" state) I'm worried this could be a source of operator confusion - has this discussion already happened in the Ironic community, or is this a TripleO specific term? To me, something like "openstack overcloud node enable" or maybe "node activate" would be more intuitive, as it implies taking an existing node from the inventory and making it active/available in the context of the overcloud deployment? Anyway, not a huge issue, but given that this is a new step in our nodes workflow, I wanted to ensure folks are comfortable with the terminology before we commit to it in code. Thanks! Steve __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On 05/21/2016 08:35 PM, Dan Prince wrote: On Fri, 2016-05-20 at 14:06 +0200, Dmitry Tantsur wrote: On 05/20/2016 01:44 PM, Dan Prince wrote: On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote: Hi all! We started some discussions on https://review.openstack.org/#/c/3 0020 0/ about the future of node management (registering, configuring and introspecting) in the new API, but I think it's more fair (and convenient) to move it here. The goal is to fix several long- standing design flaws that affect the logic behind tripleoclient. So fasten your seatbelts, here it goes. If you already understand why we need to change this logic, just scroll down to "what do you propose?" section. "introspection bulk start" is evil -- As many of you obviously know, TripleO used the following command for introspection: openstack baremetal introspection bulk start As not everyone knows though, this command does not come from ironic-inspector project, it's part of TripleO itself. And the ironic team has some big problems with it. The way it works is 1. Take all nodes in "available" state and move them to "manageable" state 2. Execute introspection for all nodes in "manageable" state 3. Move all nodes with successful introspection to "available" state. Step 3 is pretty controversial, step 1 is just horrible. This not how the ironic-inspector team designed introspection to work (hence it refuses to run on nodes in "available" state), and that's now how the ironic team expects the ironic state machine to be handled. To explain it I'll provide a brief information on the ironic state machine. ironic node lifecycle - With recent versions of the bare metal API (starting with 1.11), nodes begin their life in a state called "enroll". Nodes in this state are not available for deployment, nor for most of other actions. Ironic does not touch such nodes in any way. To make nodes alive an operator uses "manage" provisioning action to move nodes to "manageable" state. During this transition the power and management credentials (IPMI, SSH, etc) are validated to ensure that nodes in "manageable" state are, well, manageable. This state is still not available for deployment. With nodes in this state an operator can execute various pre-deployment actions, such as introspection, RAID configuration, etc. So to sum it up, nodes in "manageable" state are being configured before exposing them into the cloud. The last step before the deployment it to make nodes "available" using the "provide" provisioning action. Such nodes are exposed to nova, and can be deployed to at any moment. No long-running configuration actions should be run in this state. The "manage" action can be used to bring nodes back to "manageable" state for configuration (e.g. reintrospection). so what's the problem? -- The problem is that TripleO essentially bypasses this logic by keeping all nodes "available" and walking them through provisioning steps automatically. Just a couple of examples of what gets broken: (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for deployment (including potential autoscaling) and I want to enroll 10 more nodes. Both introspection and ready-state operations nowadays will touch both 10 new nodes AND 10 nodes which are ready for deployment, potentially making the latter not ready for deployment any more (and definitely moving them out of pool for some time). Particularly, any manual configuration made by an operator before making nodes "available" may get destroyed. (2) TripleO has to disable automated cleaning. Automated cleaning is a set of steps (currently only wiping the hard drive) that happens in ironic 1) before nodes are available, 2) after an instance is deleted. As TripleO CLI constantly moves nodes back-and-forth from and to "available" state, cleaning kicks in every time. Unless it's disabled. Disabling cleaning might sound a sufficient work around, until you need it. And you actually do. Here is a real life example of how to get yourself broken by not having cleaning: a. Deploy an overcloud instance b. Delete it c. Deploy an overcloud instance on a different hard drive d. Boom. This sounds like an Ironic bug to me. Cleaning (wiping a disk) and removing state that would break subsequent installations on a different drive are different things. In TripleO I think the reason we disable cleaning is largely because of the extra time it takes and the fact that our baremetal cloud isn't multi-tenant (currently at least). We fix this "bug" by introducing cleaning. This is the process to guarantee each deployment starts with a clean environment. It's hard to known which remained data can cause which problem (e.g. what about a remaining UEFI partition? any remainings of Ceph? I don't know). As we didn't pass cleaning, there is still a config drive on the disk used in the first deployment. With 2 config drives present cloud- init wi
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On Fri, 2016-05-20 at 14:06 +0200, Dmitry Tantsur wrote: > On 05/20/2016 01:44 PM, Dan Prince wrote: > > > > On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote: > > > > > > Hi all! > > > > > > We started some discussions on https://review.openstack.org/#/c/3 > > > 0020 > > > 0/ > > > about the future of node management (registering, configuring and > > > introspecting) in the new API, but I think it's more fair (and > > > convenient) to move it here. The goal is to fix several long- > > > standing > > > design flaws that affect the logic behind tripleoclient. So > > > fasten > > > your > > > seatbelts, here it goes. > > > > > > If you already understand why we need to change this logic, just > > > scroll > > > down to "what do you propose?" section. > > > > > > "introspection bulk start" is evil > > > -- > > > > > > As many of you obviously know, TripleO used the following command > > > for > > > introspection: > > > > > > openstack baremetal introspection bulk start > > > > > > As not everyone knows though, this command does not come from > > > ironic-inspector project, it's part of TripleO itself. And the > > > ironic > > > team has some big problems with it. > > > > > > The way it works is > > > > > > 1. Take all nodes in "available" state and move them to > > > "manageable" > > > state > > > 2. Execute introspection for all nodes in "manageable" state > > > 3. Move all nodes with successful introspection to "available" > > > state. > > > > > > Step 3 is pretty controversial, step 1 is just horrible. This not > > > how > > > the ironic-inspector team designed introspection to work (hence > > > it > > > refuses to run on nodes in "available" state), and that's now how > > > the > > > ironic team expects the ironic state machine to be handled. To > > > explain > > > it I'll provide a brief information on the ironic state machine. > > > > > > ironic node lifecycle > > > - > > > > > > With recent versions of the bare metal API (starting with 1.11), > > > nodes > > > begin their life in a state called "enroll". Nodes in this state > > > are > > > not > > > available for deployment, nor for most of other actions. Ironic > > > does > > > not > > > touch such nodes in any way. > > > > > > To make nodes alive an operator uses "manage" provisioning action > > > to > > > move nodes to "manageable" state. During this transition the > > > power > > > and > > > management credentials (IPMI, SSH, etc) are validated to ensure > > > that > > > nodes in "manageable" state are, well, manageable. This state is > > > still > > > not available for deployment. With nodes in this state an > > > operator > > > can > > > execute various pre-deployment actions, such as introspection, > > > RAID > > > configuration, etc. So to sum it up, nodes in "manageable" state > > > are > > > being configured before exposing them into the cloud. > > > > > > The last step before the deployment it to make nodes "available" > > > using > > > the "provide" provisioning action. Such nodes are exposed to > > > nova, > > > and > > > can be deployed to at any moment. No long-running configuration > > > actions > > > should be run in this state. The "manage" action can be used to > > > bring > > > nodes back to "manageable" state for configuration (e.g. > > > reintrospection). > > > > > > so what's the problem? > > > -- > > > > > > The problem is that TripleO essentially bypasses this logic by > > > keeping > > > all nodes "available" and walking them through provisioning steps > > > automatically. Just a couple of examples of what gets broken: > > > > > > (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for > > > deployment (including potential autoscaling) and I want to enroll > > > 10 > > > more nodes. > > > > > > Both introspection and ready-state operations nowadays will touch > > > both > > > 10 new nodes AND 10 nodes which are ready for deployment, > > > potentially > > > making the latter not ready for deployment any more (and > > > definitely > > > moving them out of pool for some time). > > > > > > Particularly, any manual configuration made by an operator before > > > making > > > nodes "available" may get destroyed. > > > > > > (2) TripleO has to disable automated cleaning. Automated cleaning > > > is > > > a > > > set of steps (currently only wiping the hard drive) that happens > > > in > > > ironic 1) before nodes are available, 2) after an instance is > > > deleted. > > > As TripleO CLI constantly moves nodes back-and-forth from and to > > > "available" state, cleaning kicks in every time. Unless it's > > > disabled. > > > > > > Disabling cleaning might sound a sufficient work around, until > > > you > > > need > > > it. And you actually do. Here is a real life example of how to > > > get > > > yourself broken by not having cleaning: > > > > > > a. Deploy an overcloud instance > > > b. Delete it > > > c. Deploy an overcloud i
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On 05/20/2016 03:42 PM, John Trowbridge wrote: On 05/19/2016 09:31 AM, Dmitry Tantsur wrote: Hi all! We started some discussions on https://review.openstack.org/#/c/300200/ about the future of node management (registering, configuring and introspecting) in the new API, but I think it's more fair (and convenient) to move it here. The goal is to fix several long-standing design flaws that affect the logic behind tripleoclient. So fasten your seatbelts, here it goes. If you already understand why we need to change this logic, just scroll down to "what do you propose?" section. "introspection bulk start" is evil -- As many of you obviously know, TripleO used the following command for introspection: openstack baremetal introspection bulk start As not everyone knows though, this command does not come from ironic-inspector project, it's part of TripleO itself. And the ironic team has some big problems with it. The way it works is 1. Take all nodes in "available" state and move them to "manageable" state 2. Execute introspection for all nodes in "manageable" state 3. Move all nodes with successful introspection to "available" state. Step 3 is pretty controversial, step 1 is just horrible. This not how the ironic-inspector team designed introspection to work (hence it refuses to run on nodes in "available" state), and that's now how the ironic team expects the ironic state machine to be handled. To explain it I'll provide a brief information on the ironic state machine. ironic node lifecycle - With recent versions of the bare metal API (starting with 1.11), nodes begin their life in a state called "enroll". Nodes in this state are not available for deployment, nor for most of other actions. Ironic does not touch such nodes in any way. To make nodes alive an operator uses "manage" provisioning action to move nodes to "manageable" state. During this transition the power and management credentials (IPMI, SSH, etc) are validated to ensure that nodes in "manageable" state are, well, manageable. This state is still not available for deployment. With nodes in this state an operator can execute various pre-deployment actions, such as introspection, RAID configuration, etc. So to sum it up, nodes in "manageable" state are being configured before exposing them into the cloud. The last step before the deployment it to make nodes "available" using the "provide" provisioning action. Such nodes are exposed to nova, and can be deployed to at any moment. No long-running configuration actions should be run in this state. The "manage" action can be used to bring nodes back to "manageable" state for configuration (e.g. reintrospection). so what's the problem? -- The problem is that TripleO essentially bypasses this logic by keeping all nodes "available" and walking them through provisioning steps automatically. Just a couple of examples of what gets broken: (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for deployment (including potential autoscaling) and I want to enroll 10 more nodes. Both introspection and ready-state operations nowadays will touch both 10 new nodes AND 10 nodes which are ready for deployment, potentially making the latter not ready for deployment any more (and definitely moving them out of pool for some time). Particularly, any manual configuration made by an operator before making nodes "available" may get destroyed. (2) TripleO has to disable automated cleaning. Automated cleaning is a set of steps (currently only wiping the hard drive) that happens in ironic 1) before nodes are available, 2) after an instance is deleted. As TripleO CLI constantly moves nodes back-and-forth from and to "available" state, cleaning kicks in every time. Unless it's disabled. Disabling cleaning might sound a sufficient work around, until you need it. And you actually do. Here is a real life example of how to get yourself broken by not having cleaning: a. Deploy an overcloud instance b. Delete it c. Deploy an overcloud instance on a different hard drive d. Boom. As we didn't pass cleaning, there is still a config drive on the disk used in the first deployment. With 2 config drives present cloud-init will pick a random one, breaking the deployment. To top it all, TripleO users tend to not use root device hints, so switching root disks may happen randomly between deployments. Have fun debugging. what do you propose? I would like the new TripleO mistral workflows to start following the ironic state machine closer. Imagine the following workflows: 1. register: take JSON, create nodes in "manageable" state. I do believe we can automate the enroll->manageable transition, as it serves the purpose of validation (and discovery, but lets put it aside). 2. provide: take a list of nodes or all "managable" nodes and move them to "available". By using this workflow an operator will make a *conscious* decision
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On 05/19/2016 09:31 AM, Dmitry Tantsur wrote: > Hi all! > > We started some discussions on https://review.openstack.org/#/c/300200/ > about the future of node management (registering, configuring and > introspecting) in the new API, but I think it's more fair (and > convenient) to move it here. The goal is to fix several long-standing > design flaws that affect the logic behind tripleoclient. So fasten your > seatbelts, here it goes. > > If you already understand why we need to change this logic, just scroll > down to "what do you propose?" section. > > "introspection bulk start" is evil > -- > > As many of you obviously know, TripleO used the following command for > introspection: > > openstack baremetal introspection bulk start > > As not everyone knows though, this command does not come from > ironic-inspector project, it's part of TripleO itself. And the ironic > team has some big problems with it. > > The way it works is > > 1. Take all nodes in "available" state and move them to "manageable" state > 2. Execute introspection for all nodes in "manageable" state > 3. Move all nodes with successful introspection to "available" state. > > Step 3 is pretty controversial, step 1 is just horrible. This not how > the ironic-inspector team designed introspection to work (hence it > refuses to run on nodes in "available" state), and that's now how the > ironic team expects the ironic state machine to be handled. To explain > it I'll provide a brief information on the ironic state machine. > > ironic node lifecycle > - > > With recent versions of the bare metal API (starting with 1.11), nodes > begin their life in a state called "enroll". Nodes in this state are not > available for deployment, nor for most of other actions. Ironic does not > touch such nodes in any way. > > To make nodes alive an operator uses "manage" provisioning action to > move nodes to "manageable" state. During this transition the power and > management credentials (IPMI, SSH, etc) are validated to ensure that > nodes in "manageable" state are, well, manageable. This state is still > not available for deployment. With nodes in this state an operator can > execute various pre-deployment actions, such as introspection, RAID > configuration, etc. So to sum it up, nodes in "manageable" state are > being configured before exposing them into the cloud. > > The last step before the deployment it to make nodes "available" using > the "provide" provisioning action. Such nodes are exposed to nova, and > can be deployed to at any moment. No long-running configuration actions > should be run in this state. The "manage" action can be used to bring > nodes back to "manageable" state for configuration (e.g. reintrospection). > > so what's the problem? > -- > > The problem is that TripleO essentially bypasses this logic by keeping > all nodes "available" and walking them through provisioning steps > automatically. Just a couple of examples of what gets broken: > > (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for > deployment (including potential autoscaling) and I want to enroll 10 > more nodes. > > Both introspection and ready-state operations nowadays will touch both > 10 new nodes AND 10 nodes which are ready for deployment, potentially > making the latter not ready for deployment any more (and definitely > moving them out of pool for some time). > > Particularly, any manual configuration made by an operator before making > nodes "available" may get destroyed. > > (2) TripleO has to disable automated cleaning. Automated cleaning is a > set of steps (currently only wiping the hard drive) that happens in > ironic 1) before nodes are available, 2) after an instance is deleted. > As TripleO CLI constantly moves nodes back-and-forth from and to > "available" state, cleaning kicks in every time. Unless it's disabled. > > Disabling cleaning might sound a sufficient work around, until you need > it. And you actually do. Here is a real life example of how to get > yourself broken by not having cleaning: > > a. Deploy an overcloud instance > b. Delete it > c. Deploy an overcloud instance on a different hard drive > d. Boom. > > As we didn't pass cleaning, there is still a config drive on the disk > used in the first deployment. With 2 config drives present cloud-init > will pick a random one, breaking the deployment. > > To top it all, TripleO users tend to not use root device hints, so > switching root disks may happen randomly between deployments. Have fun > debugging. > > what do you propose? > > > I would like the new TripleO mistral workflows to start following the > ironic state machine closer. Imagine the following workflows: > > 1. register: take JSON, create nodes in "manageable" state. I do believe > we can automate the enroll->manageable transition, as it serves the > purpose of validation (and discovery, but lets pu
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On 05/20/2016 02:54 PM, Steven Hardy wrote: Hi Dmitry, Thanks for the detailed write-up, some comments below: On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote: what do you propose? I would like the new TripleO mistral workflows to start following the ironic state machine closer. Imagine the following workflows: 1. register: take JSON, create nodes in "manageable" state. I do believe we can automate the enroll->manageable transition, as it serves the purpose of validation (and discovery, but lets put it aside). 2. provide: take a list of nodes or all "managable" nodes and move them to "available". By using this workflow an operator will make a *conscious* decision to add some nodes to the cloud. 3. introspect: take a list of "managable" (!!!) nodes or all "manageable" nodes and move them through introspection. This is an optional step between "register" and "provide". 4. set_node_state: a helper workflow to move nodes between states. The "provide" workflow is essentially set_node_state with verb=provide, but is separate due to its high importance in the node lifecycle. 5. configure: given a couple of parameters (deploy image, local boot flag, etc), update given or all "manageable" nodes with them. Essentially the only addition here is the "provide" action which I hope you already realize should be an explicit step. what about tripleoclient Of course we want to keep backward compatibility. The existing commands openstack baremetal import openstack baremetal configure boot openstack baremetal introspection bulk start will use some combinations of workflows above and will be deprecated. The new commands (also avoiding hijacking into the bare metal namespaces) will be provided strictly matching the workflows (especially in terms of the state machine): openstack overcloud node import openstack overcloud node configure openstack overcloud node introspect openstack overcloud node provide So, provided we maintain backwards compatibility this sounds OK, but one question - is there any alternative approach that might solve this problem more generally, e.g not only for TripleO? I was thinking about that. We could move the import command to ironicclient, but it won't support TripleO format and additions then. It's still a good thing to have, I'll talk about it upstream. As to introspect and provide, the only thing which is different from ironic analogs is that ironic commands don't act on "all nodes in XXX state", and I don't think we ever will. Given that we're likely to implement these workflows in mistral, it probably does make sense to switch to a TripleO specific namespace, but I can't help wondering if we're solving a general problem in a TripleO specific way - e.g isn't this something any user adding nodes from an inventory, introspecting them and finally making them available for deployment going to need? Also, and it may be too late to fix this, "openstack overcloud node" is kinda strange, because we're importing nodes on the undercloud, which could in theory be used for any purpose, not only overcloud deployments. I agree but keeping our stuff in ironic's namespace leads to even more confusion and even potential conflicts (e.g. we can't introduce "baremetal import", cause tripleo reserved it). We've already done arguably the wrong thing with e.g openstack overcloud image upload (which, actually, uploads images to the undercloud), but I wanted to point out that we're maintaining that inconsistency with your proposed interface (which may be the least-bad option I suppose). Thanks, Steve __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
Hi Dmitry, Thanks for the detailed write-up, some comments below: On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote: > what do you propose? > > > I would like the new TripleO mistral workflows to start following the ironic > state machine closer. Imagine the following workflows: > > 1. register: take JSON, create nodes in "manageable" state. I do believe we > can automate the enroll->manageable transition, as it serves the purpose of > validation (and discovery, but lets put it aside). > > 2. provide: take a list of nodes or all "managable" nodes and move them to > "available". By using this workflow an operator will make a *conscious* > decision to add some nodes to the cloud. > > 3. introspect: take a list of "managable" (!!!) nodes or all "manageable" > nodes and move them through introspection. This is an optional step between > "register" and "provide". > > 4. set_node_state: a helper workflow to move nodes between states. The > "provide" workflow is essentially set_node_state with verb=provide, but is > separate due to its high importance in the node lifecycle. > > 5. configure: given a couple of parameters (deploy image, local boot flag, > etc), update given or all "manageable" nodes with them. > > Essentially the only addition here is the "provide" action which I hope you > already realize should be an explicit step. > > what about tripleoclient > > > Of course we want to keep backward compatibility. The existing commands > > openstack baremetal import > openstack baremetal configure boot > openstack baremetal introspection bulk start > > will use some combinations of workflows above and will be deprecated. > > The new commands (also avoiding hijacking into the bare metal namespaces) > will be provided strictly matching the workflows (especially in terms of the > state machine): > > openstack overcloud node import > openstack overcloud node configure > openstack overcloud node introspect > openstack overcloud node provide So, provided we maintain backwards compatibility this sounds OK, but one question - is there any alternative approach that might solve this problem more generally, e.g not only for TripleO? Given that we're likely to implement these workflows in mistral, it probably does make sense to switch to a TripleO specific namespace, but I can't help wondering if we're solving a general problem in a TripleO specific way - e.g isn't this something any user adding nodes from an inventory, introspecting them and finally making them available for deployment going to need? Also, and it may be too late to fix this, "openstack overcloud node" is kinda strange, because we're importing nodes on the undercloud, which could in theory be used for any purpose, not only overcloud deployments. We've already done arguably the wrong thing with e.g openstack overcloud image upload (which, actually, uploads images to the undercloud), but I wanted to point out that we're maintaining that inconsistency with your proposed interface (which may be the least-bad option I suppose). Thanks, Steve __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On 05/20/2016 01:44 PM, Dan Prince wrote: On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote: Hi all! We started some discussions on https://review.openstack.org/#/c/30020 0/ about the future of node management (registering, configuring and introspecting) in the new API, but I think it's more fair (and convenient) to move it here. The goal is to fix several long- standing design flaws that affect the logic behind tripleoclient. So fasten your seatbelts, here it goes. If you already understand why we need to change this logic, just scroll down to "what do you propose?" section. "introspection bulk start" is evil -- As many of you obviously know, TripleO used the following command for introspection: openstack baremetal introspection bulk start As not everyone knows though, this command does not come from ironic-inspector project, it's part of TripleO itself. And the ironic team has some big problems with it. The way it works is 1. Take all nodes in "available" state and move them to "manageable" state 2. Execute introspection for all nodes in "manageable" state 3. Move all nodes with successful introspection to "available" state. Step 3 is pretty controversial, step 1 is just horrible. This not how the ironic-inspector team designed introspection to work (hence it refuses to run on nodes in "available" state), and that's now how the ironic team expects the ironic state machine to be handled. To explain it I'll provide a brief information on the ironic state machine. ironic node lifecycle - With recent versions of the bare metal API (starting with 1.11), nodes begin their life in a state called "enroll". Nodes in this state are not available for deployment, nor for most of other actions. Ironic does not touch such nodes in any way. To make nodes alive an operator uses "manage" provisioning action to move nodes to "manageable" state. During this transition the power and management credentials (IPMI, SSH, etc) are validated to ensure that nodes in "manageable" state are, well, manageable. This state is still not available for deployment. With nodes in this state an operator can execute various pre-deployment actions, such as introspection, RAID configuration, etc. So to sum it up, nodes in "manageable" state are being configured before exposing them into the cloud. The last step before the deployment it to make nodes "available" using the "provide" provisioning action. Such nodes are exposed to nova, and can be deployed to at any moment. No long-running configuration actions should be run in this state. The "manage" action can be used to bring nodes back to "manageable" state for configuration (e.g. reintrospection). so what's the problem? -- The problem is that TripleO essentially bypasses this logic by keeping all nodes "available" and walking them through provisioning steps automatically. Just a couple of examples of what gets broken: (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for deployment (including potential autoscaling) and I want to enroll 10 more nodes. Both introspection and ready-state operations nowadays will touch both 10 new nodes AND 10 nodes which are ready for deployment, potentially making the latter not ready for deployment any more (and definitely moving them out of pool for some time). Particularly, any manual configuration made by an operator before making nodes "available" may get destroyed. (2) TripleO has to disable automated cleaning. Automated cleaning is a set of steps (currently only wiping the hard drive) that happens in ironic 1) before nodes are available, 2) after an instance is deleted. As TripleO CLI constantly moves nodes back-and-forth from and to "available" state, cleaning kicks in every time. Unless it's disabled. Disabling cleaning might sound a sufficient work around, until you need it. And you actually do. Here is a real life example of how to get yourself broken by not having cleaning: a. Deploy an overcloud instance b. Delete it c. Deploy an overcloud instance on a different hard drive d. Boom. This sounds like an Ironic bug to me. Cleaning (wiping a disk) and removing state that would break subsequent installations on a different drive are different things. In TripleO I think the reason we disable cleaning is largely because of the extra time it takes and the fact that our baremetal cloud isn't multi-tenant (currently at least). We fix this "bug" by introducing cleaning. This is the process to guarantee each deployment starts with a clean environment. It's hard to known which remained data can cause which problem (e.g. what about a remaining UEFI partition? any remainings of Ceph? I don't know). As we didn't pass cleaning, there is still a config drive on the disk used in the first deployment. With 2 config drives present cloud- init will pick a random one, breaking the deployment. TripleO isn't using config drive is it? Until Nova
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
Hi, > This sounds like an Ironic bug to me. Cleaning (wiping a disk) and > removing state that would break subsequent installations on a different > drive are different things. In TripleO I think the reason we disable > cleaning is largely because of the extra time it takes and the fact > that our baremetal cloud isn't multi-tenant (currently at least). > It's a complicated issue, there are ways in Ironic to make sure the image will always be deployed onto a specific hard drive [0]. But when it's not specified Ironic will pick the first disk that appears and in Linux, at least for SATA, SCSI or IDE disk controllers, the order in which the devices are added is arbitrary, e.g, /dev/sda and /dev/sdb could swap around between reboots. >> >> As we didn't pass cleaning, there is still a config drive on the >> disk >> used in the first deployment. With 2 config drives present cloud- >> init >> will pick a random one, breaking the deployment. > > TripleO isn't using config drive is it? Until Nova supports config > drives via Ironic I think we are blocked on using it. > It's already supported, for two or more cycles already [1]. The difference with VMs is that, with baremetal the config drive lives in the disk as a partition and for VMs it's presented as an external device. [0] http://docs.openstack.org/developer/ironic/deploy/install-guide.html#specifying-the-disk-for-deployment [1] http://docs.openstack.org/developer/ironic/deploy/install-guide.html#enabling-the-configuration-drive-configdrive Hope that helps, Lucas __ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote: > Hi all! > > We started some discussions on https://review.openstack.org/#/c/30020 > 0/ > about the future of node management (registering, configuring and > introspecting) in the new API, but I think it's more fair (and > convenient) to move it here. The goal is to fix several long- > standing > design flaws that affect the logic behind tripleoclient. So fasten > your > seatbelts, here it goes. > > If you already understand why we need to change this logic, just > scroll > down to "what do you propose?" section. > > "introspection bulk start" is evil > -- > > As many of you obviously know, TripleO used the following command > for > introspection: > > openstack baremetal introspection bulk start > > As not everyone knows though, this command does not come from > ironic-inspector project, it's part of TripleO itself. And the > ironic > team has some big problems with it. > > The way it works is > > 1. Take all nodes in "available" state and move them to "manageable" > state > 2. Execute introspection for all nodes in "manageable" state > 3. Move all nodes with successful introspection to "available" state. > > Step 3 is pretty controversial, step 1 is just horrible. This not > how > the ironic-inspector team designed introspection to work (hence it > refuses to run on nodes in "available" state), and that's now how > the > ironic team expects the ironic state machine to be handled. To > explain > it I'll provide a brief information on the ironic state machine. > > ironic node lifecycle > - > > With recent versions of the bare metal API (starting with 1.11), > nodes > begin their life in a state called "enroll". Nodes in this state are > not > available for deployment, nor for most of other actions. Ironic does > not > touch such nodes in any way. > > To make nodes alive an operator uses "manage" provisioning action to > move nodes to "manageable" state. During this transition the power > and > management credentials (IPMI, SSH, etc) are validated to ensure that > nodes in "manageable" state are, well, manageable. This state is > still > not available for deployment. With nodes in this state an operator > can > execute various pre-deployment actions, such as introspection, RAID > configuration, etc. So to sum it up, nodes in "manageable" state are > being configured before exposing them into the cloud. > > The last step before the deployment it to make nodes "available" > using > the "provide" provisioning action. Such nodes are exposed to nova, > and > can be deployed to at any moment. No long-running configuration > actions > should be run in this state. The "manage" action can be used to > bring > nodes back to "manageable" state for configuration (e.g. > reintrospection). > > so what's the problem? > -- > > The problem is that TripleO essentially bypasses this logic by > keeping > all nodes "available" and walking them through provisioning steps > automatically. Just a couple of examples of what gets broken: > > (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for > deployment (including potential autoscaling) and I want to enroll 10 > more nodes. > > Both introspection and ready-state operations nowadays will touch > both > 10 new nodes AND 10 nodes which are ready for deployment, > potentially > making the latter not ready for deployment any more (and definitely > moving them out of pool for some time). > > Particularly, any manual configuration made by an operator before > making > nodes "available" may get destroyed. > > (2) TripleO has to disable automated cleaning. Automated cleaning is > a > set of steps (currently only wiping the hard drive) that happens in > ironic 1) before nodes are available, 2) after an instance is > deleted. > As TripleO CLI constantly moves nodes back-and-forth from and to > "available" state, cleaning kicks in every time. Unless it's > disabled. > > Disabling cleaning might sound a sufficient work around, until you > need > it. And you actually do. Here is a real life example of how to get > yourself broken by not having cleaning: > > a. Deploy an overcloud instance > b. Delete it > c. Deploy an overcloud instance on a different hard drive > d. Boom. This sounds like an Ironic bug to me. Cleaning (wiping a disk) and removing state that would break subsequent installations on a different drive are different things. In TripleO I think the reason we disable cleaning is largely because of the extra time it takes and the fact that our baremetal cloud isn't multi-tenant (currently at least). > > As we didn't pass cleaning, there is still a config drive on the > disk > used in the first deployment. With 2 config drives present cloud- > init > will pick a random one, breaking the deployment. TripleO isn't using config drive is it? Until Nova supports config drives via Iron
[openstack-dev] [tripleo] Nodes management in our shiny new TripleO API
Hi all! We started some discussions on https://review.openstack.org/#/c/300200/ about the future of node management (registering, configuring and introspecting) in the new API, but I think it's more fair (and convenient) to move it here. The goal is to fix several long-standing design flaws that affect the logic behind tripleoclient. So fasten your seatbelts, here it goes. If you already understand why we need to change this logic, just scroll down to "what do you propose?" section. "introspection bulk start" is evil -- As many of you obviously know, TripleO used the following command for introspection: openstack baremetal introspection bulk start As not everyone knows though, this command does not come from ironic-inspector project, it's part of TripleO itself. And the ironic team has some big problems with it. The way it works is 1. Take all nodes in "available" state and move them to "manageable" state 2. Execute introspection for all nodes in "manageable" state 3. Move all nodes with successful introspection to "available" state. Step 3 is pretty controversial, step 1 is just horrible. This not how the ironic-inspector team designed introspection to work (hence it refuses to run on nodes in "available" state), and that's now how the ironic team expects the ironic state machine to be handled. To explain it I'll provide a brief information on the ironic state machine. ironic node lifecycle - With recent versions of the bare metal API (starting with 1.11), nodes begin their life in a state called "enroll". Nodes in this state are not available for deployment, nor for most of other actions. Ironic does not touch such nodes in any way. To make nodes alive an operator uses "manage" provisioning action to move nodes to "manageable" state. During this transition the power and management credentials (IPMI, SSH, etc) are validated to ensure that nodes in "manageable" state are, well, manageable. This state is still not available for deployment. With nodes in this state an operator can execute various pre-deployment actions, such as introspection, RAID configuration, etc. So to sum it up, nodes in "manageable" state are being configured before exposing them into the cloud. The last step before the deployment it to make nodes "available" using the "provide" provisioning action. Such nodes are exposed to nova, and can be deployed to at any moment. No long-running configuration actions should be run in this state. The "manage" action can be used to bring nodes back to "manageable" state for configuration (e.g. reintrospection). so what's the problem? -- The problem is that TripleO essentially bypasses this logic by keeping all nodes "available" and walking them through provisioning steps automatically. Just a couple of examples of what gets broken: (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for deployment (including potential autoscaling) and I want to enroll 10 more nodes. Both introspection and ready-state operations nowadays will touch both 10 new nodes AND 10 nodes which are ready for deployment, potentially making the latter not ready for deployment any more (and definitely moving them out of pool for some time). Particularly, any manual configuration made by an operator before making nodes "available" may get destroyed. (2) TripleO has to disable automated cleaning. Automated cleaning is a set of steps (currently only wiping the hard drive) that happens in ironic 1) before nodes are available, 2) after an instance is deleted. As TripleO CLI constantly moves nodes back-and-forth from and to "available" state, cleaning kicks in every time. Unless it's disabled. Disabling cleaning might sound a sufficient work around, until you need it. And you actually do. Here is a real life example of how to get yourself broken by not having cleaning: a. Deploy an overcloud instance b. Delete it c. Deploy an overcloud instance on a different hard drive d. Boom. As we didn't pass cleaning, there is still a config drive on the disk used in the first deployment. With 2 config drives present cloud-init will pick a random one, breaking the deployment. To top it all, TripleO users tend to not use root device hints, so switching root disks may happen randomly between deployments. Have fun debugging. what do you propose? I would like the new TripleO mistral workflows to start following the ironic state machine closer. Imagine the following workflows: 1. register: take JSON, create nodes in "manageable" state. I do believe we can automate the enroll->manageable transition, as it serves the purpose of validation (and discovery, but lets put it aside). 2. provide: take a list of nodes or all "managable" nodes and move them to "available". By using this workflow an operator will make a *conscious* decision to add some nodes to the clou