Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-07-05 Thread Steven Hardy
On Tue, Jul 05, 2016 at 12:22:33PM +0200, Dmitry Tantsur wrote:
> On 07/04/2016 01:42 PM, Steven Hardy wrote:
> > Hi Dmitry,
> > 
> > I wanted to revisit this thread, as I see some of these interfaces
> > are now posted for review, and I have a couple of questions around
> > the naming (specifically for the "provide" action):
> > 
> > On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote:
> > 
> > > The last step before the deployment it to make nodes "available" using the
> > > "provide" provisioning action. Such nodes are exposed to nova, and can be
> > > deployed to at any moment. No long-running configuration actions should be
> > > run in this state. The "manage" action can be used to bring nodes back to
> > > "manageable" state for configuration (e.g. reintrospection).
> > 
> > So, I've been reviewing https://review.openstack.org/#/c/334411/ which
> > implements support for "openstack overcloud node provide"
> > 
> > I really hate to be the one nitpicking over openstackclient verbiage, but
> > I'm a little unsure if the literal translation of this results in an
> > intuitive understanding of what happens to the nodes as a result of this
> > action. So I wanted to have a broaded discussion before we land the code
> > and commit to this interface.
> > 
> 
> > 
> > Here, I think the problem is that while the dictionary definition of
> > "provide" is "make available for use, supply" (according to google), it
> > implies obtaining the node, not just activating it.
> > 
> > So, to me "provide node" implies going and physically getting the node that
> > does not yet exist, but AFAICT what this action actually does is takes an
> > existing node, and activates it (sets it to "available" state)
> > 
> > I'm worried this could be a source of operator confusion - has this
> > discussion already happened in the Ironic community, or is this a TripleO
> > specific term?
> 
> Hi, and thanks for the great question.
> 
> As I've already responded on the patch, this term is settled in our OSC
> plugin spec [1], and we feel like it reflects the reality pretty well. But I
> clearly understand that naming things is really hard, and what feels obvious
> to me does not feel obvious to the others. Anyway, I'd prefer if we stay
> consistent with how Ironic names things now.
> 
> [1] 
> http://specs.openstack.org/openstack/ironic-specs/specs/approved/ironicclient-osc-plugin.html

Thanks, this is the context I was missing - If the term is already accepted
by the ironic community then I agree, let's keep things consistent.

Thanks!

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-07-05 Thread Dmitry Tantsur

On 07/04/2016 01:42 PM, Steven Hardy wrote:

Hi Dmitry,

I wanted to revisit this thread, as I see some of these interfaces
are now posted for review, and I have a couple of questions around
the naming (specifically for the "provide" action):

On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote:


The last step before the deployment it to make nodes "available" using the
"provide" provisioning action. Such nodes are exposed to nova, and can be
deployed to at any moment. No long-running configuration actions should be
run in this state. The "manage" action can be used to bring nodes back to
"manageable" state for configuration (e.g. reintrospection).


So, I've been reviewing https://review.openstack.org/#/c/334411/ which
implements support for "openstack overcloud node provide"

I really hate to be the one nitpicking over openstackclient verbiage, but
I'm a little unsure if the literal translation of this results in an
intuitive understanding of what happens to the nodes as a result of this
action. So I wanted to have a broaded discussion before we land the code
and commit to this interface.





Here, I think the problem is that while the dictionary definition of
"provide" is "make available for use, supply" (according to google), it
implies obtaining the node, not just activating it.

So, to me "provide node" implies going and physically getting the node that
does not yet exist, but AFAICT what this action actually does is takes an
existing node, and activates it (sets it to "available" state)

I'm worried this could be a source of operator confusion - has this
discussion already happened in the Ironic community, or is this a TripleO
specific term?


Hi, and thanks for the great question.

As I've already responded on the patch, this term is settled in our OSC 
plugin spec [1], and we feel like it reflects the reality pretty well. 
But I clearly understand that naming things is really hard, and what 
feels obvious to me does not feel obvious to the others. Anyway, I'd 
prefer if we stay consistent with how Ironic names things now.


[1] 
http://specs.openstack.org/openstack/ironic-specs/specs/approved/ironicclient-osc-plugin.html




To me, something like "openstack overcloud node enable" or maybe "node
activate" would be more intuitive, as it implies taking an existing node
from the inventory and making it active/available in the context of the
overcloud deployment?


The problem here is that "provide" does not just "enable" nodes. It also 
makes nodes pass through cleaning, which may be a pretty complex and 
long process (we have it disabled for TripleO for this reason).




Anyway, not a huge issue, but given that this is a new step in our nodes
workflow, I wanted to ensure folks are comfortable with the terminology
before we commit to it in code.

Thanks!

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-07-04 Thread Sanjay Upadhyay
On Mon, Jul 4, 2016 at 5:12 PM, Steven Hardy  wrote:

> Hi Dmitry,
>
> I wanted to revisit this thread, as I see some of these interfaces
> are now posted for review, and I have a couple of questions around
> the naming (specifically for the "provide" action):
>
> On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote:
> 
> > The last step before the deployment it to make nodes "available" using
> the
> > "provide" provisioning action. Such nodes are exposed to nova, and can be
> > deployed to at any moment. No long-running configuration actions should
> be
> > run in this state. The "manage" action can be used to bring nodes back to
> > "manageable" state for configuration (e.g. reintrospection).
>
> So, I've been reviewing https://review.openstack.org/#/c/334411/ which
> implements support for "openstack overcloud node provide"
>
> I really hate to be the one nitpicking over openstackclient verbiage, but
> I'm a little unsure if the literal translation of this results in an
> intuitive understanding of what happens to the nodes as a result of this
> action. So I wanted to have a broaded discussion before we land the code
> and commit to this interface.
>
> More info below:
> 
> > what do you propose?
> > 
> >
> > I would like the new TripleO mistral workflows to start following the
> ironic
> > state machine closer. Imagine the following workflows:
> >
> > 1. register: take JSON, create nodes in "manageable" state. I do believe
> we
> > can automate the enroll->manageable transition, as it serves the purpose
> of
> > validation (and discovery, but lets put it aside).
> >
> > 2. provide: take a list of nodes or all "managable" nodes and move them
> to
> > "available". By using this workflow an operator will make a *conscious*
> > decision to add some nodes to the cloud.
>
> Here, I think the problem is that while the dictionary definition of
> "provide" is "make available for use, supply" (according to google), it
> implies obtaining the node, not just activating it.
>
> So, to me "provide node" implies going and physically getting the node that
> does not yet exist, but AFAICT what this action actually does is takes an
> existing node, and activates it (sets it to "available" state)
>
> I'm worried this could be a source of operator confusion - has this
> discussion already happened in the Ironic community, or is this a TripleO
> specific term?
>
> To me, something like "openstack overcloud node enable" or maybe "node
> activate" would be more intuitive, as it implies taking an existing node
> from the inventory and making it active/available in the context of the
> overcloud deployment?
>


My 2 cents, as a operator, the part wherein a node is enrolled, manageable,
available is a bit confusing to first timers.
If we have something more simple ie all baremetal nodes (baremetal nodes
are the nodes enrolled or manageable states), all cluster nodes (are either
available or deployed states).

I do not know if there is a deployed state :)
regards
/sanjay


>
> Anyway, not a huge issue, but given that this is a new step in our nodes
> workflow, I wanted to ensure folks are comfortable with the terminology
> before we commit to it in code.
>
> Thanks!
>
> Steve
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-07-04 Thread Steven Hardy
Hi Dmitry,

I wanted to revisit this thread, as I see some of these interfaces
are now posted for review, and I have a couple of questions around
the naming (specifically for the "provide" action):

On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote:

> The last step before the deployment it to make nodes "available" using the
> "provide" provisioning action. Such nodes are exposed to nova, and can be
> deployed to at any moment. No long-running configuration actions should be
> run in this state. The "manage" action can be used to bring nodes back to
> "manageable" state for configuration (e.g. reintrospection).

So, I've been reviewing https://review.openstack.org/#/c/334411/ which
implements support for "openstack overcloud node provide"

I really hate to be the one nitpicking over openstackclient verbiage, but
I'm a little unsure if the literal translation of this results in an
intuitive understanding of what happens to the nodes as a result of this
action. So I wanted to have a broaded discussion before we land the code
and commit to this interface.

More info below:

> what do you propose?
> 
> 
> I would like the new TripleO mistral workflows to start following the ironic
> state machine closer. Imagine the following workflows:
> 
> 1. register: take JSON, create nodes in "manageable" state. I do believe we
> can automate the enroll->manageable transition, as it serves the purpose of
> validation (and discovery, but lets put it aside).
> 
> 2. provide: take a list of nodes or all "managable" nodes and move them to
> "available". By using this workflow an operator will make a *conscious*
> decision to add some nodes to the cloud.

Here, I think the problem is that while the dictionary definition of
"provide" is "make available for use, supply" (according to google), it
implies obtaining the node, not just activating it.

So, to me "provide node" implies going and physically getting the node that
does not yet exist, but AFAICT what this action actually does is takes an
existing node, and activates it (sets it to "available" state)

I'm worried this could be a source of operator confusion - has this
discussion already happened in the Ironic community, or is this a TripleO
specific term?

To me, something like "openstack overcloud node enable" or maybe "node
activate" would be more intuitive, as it implies taking an existing node
from the inventory and making it active/available in the context of the
overcloud deployment?

Anyway, not a huge issue, but given that this is a new step in our nodes
workflow, I wanted to ensure folks are comfortable with the terminology
before we commit to it in code.

Thanks!

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-23 Thread Dmitry Tantsur

On 05/21/2016 08:35 PM, Dan Prince wrote:

On Fri, 2016-05-20 at 14:06 +0200, Dmitry Tantsur wrote:

On 05/20/2016 01:44 PM, Dan Prince wrote:


On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote:


Hi all!

We started some discussions on https://review.openstack.org/#/c/3
0020
0/
about the future of node management (registering, configuring and
introspecting) in the new API, but I think it's more fair (and
convenient) to move it here. The goal is to fix several long-
standing
design flaws that affect the logic behind tripleoclient. So
fasten
your
seatbelts, here it goes.

If you already understand why we need to change this logic, just
scroll
down to "what do you propose?" section.

"introspection bulk start" is evil
--

As many of you obviously know, TripleO used the following command
for
introspection:

  openstack baremetal introspection bulk start

As not everyone knows though, this command does not come from
ironic-inspector project, it's part of TripleO itself. And the
ironic
team has some big problems with it.

The way it works is

1. Take all nodes in "available" state and move them to
"manageable"
state
2. Execute introspection for all nodes in "manageable" state
3. Move all nodes with successful introspection to "available"
state.

Step 3 is pretty controversial, step 1 is just horrible. This not
how
the ironic-inspector team designed introspection to work (hence
it
refuses to run on nodes in "available" state), and that's now how
the
ironic team expects the ironic state machine to be handled. To
explain
it I'll provide a brief information on the ironic state machine.

ironic node lifecycle
-

With recent versions of the bare metal API (starting with 1.11),
nodes
begin their life in a state called "enroll". Nodes in this state
are
not
available for deployment, nor for most of other actions. Ironic
does
not
touch such nodes in any way.

To make nodes alive an operator uses "manage" provisioning action
to
move nodes to "manageable" state. During this transition the
power
and
management credentials (IPMI, SSH, etc) are validated to ensure
that
nodes in "manageable" state are, well, manageable. This state is
still
not available for deployment. With nodes in this state an
operator
can
execute various pre-deployment actions, such as introspection,
RAID
configuration, etc. So to sum it up, nodes in "manageable" state
are
being configured before exposing them into the cloud.

The last step before the deployment it to make nodes "available"
using
the "provide" provisioning action. Such nodes are exposed to
nova,
and
can be deployed to at any moment. No long-running configuration
actions
should be run in this state. The "manage" action can be used to
bring
nodes back to "manageable" state for configuration (e.g.
reintrospection).

so what's the problem?
--

The problem is that TripleO essentially bypasses this logic by
keeping
all nodes "available" and walking them through provisioning steps
automatically. Just a couple of examples of what gets broken:

(1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for
deployment (including potential autoscaling) and I want to enroll
10
more nodes.

Both introspection and ready-state operations nowadays will touch
both
10 new nodes AND 10 nodes which are ready for deployment,
potentially
making the latter not ready for deployment any more (and
definitely
moving them out of pool for some time).

Particularly, any manual configuration made by an operator before
making
nodes "available" may get destroyed.

(2) TripleO has to disable automated cleaning. Automated cleaning
is
a
set of steps (currently only wiping the hard drive) that happens
in
ironic 1) before nodes are available, 2) after an instance is
deleted.
As TripleO CLI constantly moves nodes back-and-forth from and to
"available" state, cleaning kicks in every time. Unless it's
disabled.

Disabling cleaning might sound a sufficient work around, until
you
need
it. And you actually do. Here is a real life example of how to
get
yourself broken by not having cleaning:

a. Deploy an overcloud instance
b. Delete it
c. Deploy an overcloud instance on a different hard drive
d. Boom.

This sounds like an Ironic bug to me. Cleaning (wiping a disk) and
removing state that would break subsequent installations on a
different
drive are different things. In TripleO I think the reason we
disable
cleaning is largely because of the extra time it takes and the fact
that our baremetal cloud isn't multi-tenant (currently at least).

We fix this "bug" by introducing cleaning. This is the process to
guarantee each deployment starts with a clean environment. It's hard
to
known which remained data can cause which problem (e.g. what about a
remaining UEFI partition? any remainings of Ceph? I don't know).







As we didn't pass cleaning, there is still a config drive on the
disk
used in the first deployment. With 2 config drives present cloud-
init
wi

Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-21 Thread Dan Prince
On Fri, 2016-05-20 at 14:06 +0200, Dmitry Tantsur wrote:
> On 05/20/2016 01:44 PM, Dan Prince wrote:
> > 
> > On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote:
> > > 
> > > Hi all!
> > > 
> > > We started some discussions on https://review.openstack.org/#/c/3
> > > 0020
> > > 0/
> > > about the future of node management (registering, configuring and
> > > introspecting) in the new API, but I think it's more fair (and
> > > convenient) to move it here. The goal is to fix several long-
> > > standing
> > > design flaws that affect the logic behind tripleoclient. So
> > > fasten
> > > your
> > > seatbelts, here it goes.
> > > 
> > > If you already understand why we need to change this logic, just
> > > scroll
> > > down to "what do you propose?" section.
> > > 
> > > "introspection bulk start" is evil
> > > --
> > > 
> > > As many of you obviously know, TripleO used the following command
> > > for
> > > introspection:
> > > 
> > >   openstack baremetal introspection bulk start
> > > 
> > > As not everyone knows though, this command does not come from
> > > ironic-inspector project, it's part of TripleO itself. And the
> > > ironic
> > > team has some big problems with it.
> > > 
> > > The way it works is
> > > 
> > > 1. Take all nodes in "available" state and move them to
> > > "manageable"
> > > state
> > > 2. Execute introspection for all nodes in "manageable" state
> > > 3. Move all nodes with successful introspection to "available"
> > > state.
> > > 
> > > Step 3 is pretty controversial, step 1 is just horrible. This not
> > > how
> > > the ironic-inspector team designed introspection to work (hence
> > > it
> > > refuses to run on nodes in "available" state), and that's now how
> > > the
> > > ironic team expects the ironic state machine to be handled. To
> > > explain
> > > it I'll provide a brief information on the ironic state machine.
> > > 
> > > ironic node lifecycle
> > > -
> > > 
> > > With recent versions of the bare metal API (starting with 1.11),
> > > nodes
> > > begin their life in a state called "enroll". Nodes in this state
> > > are
> > > not
> > > available for deployment, nor for most of other actions. Ironic
> > > does
> > > not
> > > touch such nodes in any way.
> > > 
> > > To make nodes alive an operator uses "manage" provisioning action
> > > to
> > > move nodes to "manageable" state. During this transition the
> > > power
> > > and
> > > management credentials (IPMI, SSH, etc) are validated to ensure
> > > that
> > > nodes in "manageable" state are, well, manageable. This state is
> > > still
> > > not available for deployment. With nodes in this state an
> > > operator
> > > can
> > > execute various pre-deployment actions, such as introspection,
> > > RAID
> > > configuration, etc. So to sum it up, nodes in "manageable" state
> > > are
> > > being configured before exposing them into the cloud.
> > > 
> > > The last step before the deployment it to make nodes "available"
> > > using
> > > the "provide" provisioning action. Such nodes are exposed to
> > > nova,
> > > and
> > > can be deployed to at any moment. No long-running configuration
> > > actions
> > > should be run in this state. The "manage" action can be used to
> > > bring
> > > nodes back to "manageable" state for configuration (e.g.
> > > reintrospection).
> > > 
> > > so what's the problem?
> > > --
> > > 
> > > The problem is that TripleO essentially bypasses this logic by
> > > keeping
> > > all nodes "available" and walking them through provisioning steps
> > > automatically. Just a couple of examples of what gets broken:
> > > 
> > > (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for
> > > deployment (including potential autoscaling) and I want to enroll
> > > 10
> > > more nodes.
> > > 
> > > Both introspection and ready-state operations nowadays will touch
> > > both
> > > 10 new nodes AND 10 nodes which are ready for deployment,
> > > potentially
> > > making the latter not ready for deployment any more (and
> > > definitely
> > > moving them out of pool for some time).
> > > 
> > > Particularly, any manual configuration made by an operator before
> > > making
> > > nodes "available" may get destroyed.
> > > 
> > > (2) TripleO has to disable automated cleaning. Automated cleaning
> > > is
> > > a
> > > set of steps (currently only wiping the hard drive) that happens
> > > in
> > > ironic 1) before nodes are available, 2) after an instance is
> > > deleted.
> > > As TripleO CLI constantly moves nodes back-and-forth from and to
> > > "available" state, cleaning kicks in every time. Unless it's
> > > disabled.
> > > 
> > > Disabling cleaning might sound a sufficient work around, until
> > > you
> > > need
> > > it. And you actually do. Here is a real life example of how to
> > > get
> > > yourself broken by not having cleaning:
> > > 
> > > a. Deploy an overcloud instance
> > > b. Delete it
> > > c. Deploy an overcloud i

Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-20 Thread Dmitry Tantsur

On 05/20/2016 03:42 PM, John Trowbridge wrote:



On 05/19/2016 09:31 AM, Dmitry Tantsur wrote:

Hi all!

We started some discussions on https://review.openstack.org/#/c/300200/
about the future of node management (registering, configuring and
introspecting) in the new API, but I think it's more fair (and
convenient) to move it here. The goal is to fix several long-standing
design flaws that affect the logic behind tripleoclient. So fasten your
seatbelts, here it goes.

If you already understand why we need to change this logic, just scroll
down to "what do you propose?" section.

"introspection bulk start" is evil
--

As many of you obviously know, TripleO used the following command for
introspection:

 openstack baremetal introspection bulk start

As not everyone knows though, this command does not come from
ironic-inspector project, it's part of TripleO itself. And the ironic
team has some big problems with it.

The way it works is

1. Take all nodes in "available" state and move them to "manageable" state
2. Execute introspection for all nodes in "manageable" state
3. Move all nodes with successful introspection to "available" state.

Step 3 is pretty controversial, step 1 is just horrible. This not how
the ironic-inspector team designed introspection to work (hence it
refuses to run on nodes in "available" state), and that's now how the
ironic team expects the ironic state machine to be handled. To explain
it I'll provide a brief information on the ironic state machine.

ironic node lifecycle
-

With recent versions of the bare metal API (starting with 1.11), nodes
begin their life in a state called "enroll". Nodes in this state are not
available for deployment, nor for most of other actions. Ironic does not
touch such nodes in any way.

To make nodes alive an operator uses "manage" provisioning action to
move nodes to "manageable" state. During this transition the power and
management credentials (IPMI, SSH, etc) are validated to ensure that
nodes in "manageable" state are, well, manageable. This state is still
not available for deployment. With nodes in this state an operator can
execute various pre-deployment actions, such as introspection, RAID
configuration, etc. So to sum it up, nodes in "manageable" state are
being configured before exposing them into the cloud.

The last step before the deployment it to make nodes "available" using
the "provide" provisioning action. Such nodes are exposed to nova, and
can be deployed to at any moment. No long-running configuration actions
should be run in this state. The "manage" action can be used to bring
nodes back to "manageable" state for configuration (e.g. reintrospection).

so what's the problem?
--

The problem is that TripleO essentially bypasses this logic by keeping
all nodes "available" and walking them through provisioning steps
automatically. Just a couple of examples of what gets broken:

(1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for
deployment (including potential autoscaling) and I want to enroll 10
more nodes.

Both introspection and ready-state operations nowadays will touch both
10 new nodes AND 10 nodes which are ready for deployment, potentially
making the latter not ready for deployment any more (and definitely
moving them out of pool for some time).

Particularly, any manual configuration made by an operator before making
nodes "available" may get destroyed.

(2) TripleO has to disable automated cleaning. Automated cleaning is a
set of steps (currently only wiping the hard drive) that happens in
ironic 1) before nodes are available, 2) after an instance is deleted.
As TripleO CLI constantly moves nodes back-and-forth from and to
"available" state, cleaning kicks in every time. Unless it's disabled.

Disabling cleaning might sound a sufficient work around, until you need
it. And you actually do. Here is a real life example of how to get
yourself broken by not having cleaning:

a. Deploy an overcloud instance
b. Delete it
c. Deploy an overcloud instance on a different hard drive
d. Boom.

As we didn't pass cleaning, there is still a config drive on the disk
used in the first deployment. With 2 config drives present cloud-init
will pick a random one, breaking the deployment.

To top it all, TripleO users tend to not use root device hints, so
switching root disks may happen randomly between deployments. Have fun
debugging.

what do you propose?


I would like the new TripleO mistral workflows to start following the
ironic state machine closer. Imagine the following workflows:

1. register: take JSON, create nodes in "manageable" state. I do believe
we can automate the enroll->manageable transition, as it serves the
purpose of validation (and discovery, but lets put it aside).

2. provide: take a list of nodes or all "managable" nodes and move them
to "available". By using this workflow an operator will make a
*conscious* decision 

Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-20 Thread John Trowbridge


On 05/19/2016 09:31 AM, Dmitry Tantsur wrote:
> Hi all!
> 
> We started some discussions on https://review.openstack.org/#/c/300200/
> about the future of node management (registering, configuring and
> introspecting) in the new API, but I think it's more fair (and
> convenient) to move it here. The goal is to fix several long-standing
> design flaws that affect the logic behind tripleoclient. So fasten your
> seatbelts, here it goes.
> 
> If you already understand why we need to change this logic, just scroll
> down to "what do you propose?" section.
> 
> "introspection bulk start" is evil
> --
> 
> As many of you obviously know, TripleO used the following command for
> introspection:
> 
>  openstack baremetal introspection bulk start
> 
> As not everyone knows though, this command does not come from
> ironic-inspector project, it's part of TripleO itself. And the ironic
> team has some big problems with it.
> 
> The way it works is
> 
> 1. Take all nodes in "available" state and move them to "manageable" state
> 2. Execute introspection for all nodes in "manageable" state
> 3. Move all nodes with successful introspection to "available" state.
> 
> Step 3 is pretty controversial, step 1 is just horrible. This not how
> the ironic-inspector team designed introspection to work (hence it
> refuses to run on nodes in "available" state), and that's now how the
> ironic team expects the ironic state machine to be handled. To explain
> it I'll provide a brief information on the ironic state machine.
> 
> ironic node lifecycle
> -
> 
> With recent versions of the bare metal API (starting with 1.11), nodes
> begin their life in a state called "enroll". Nodes in this state are not
> available for deployment, nor for most of other actions. Ironic does not
> touch such nodes in any way.
> 
> To make nodes alive an operator uses "manage" provisioning action to
> move nodes to "manageable" state. During this transition the power and
> management credentials (IPMI, SSH, etc) are validated to ensure that
> nodes in "manageable" state are, well, manageable. This state is still
> not available for deployment. With nodes in this state an operator can
> execute various pre-deployment actions, such as introspection, RAID
> configuration, etc. So to sum it up, nodes in "manageable" state are
> being configured before exposing them into the cloud.
> 
> The last step before the deployment it to make nodes "available" using
> the "provide" provisioning action. Such nodes are exposed to nova, and
> can be deployed to at any moment. No long-running configuration actions
> should be run in this state. The "manage" action can be used to bring
> nodes back to "manageable" state for configuration (e.g. reintrospection).
> 
> so what's the problem?
> --
> 
> The problem is that TripleO essentially bypasses this logic by keeping
> all nodes "available" and walking them through provisioning steps
> automatically. Just a couple of examples of what gets broken:
> 
> (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for
> deployment (including potential autoscaling) and I want to enroll 10
> more nodes.
> 
> Both introspection and ready-state operations nowadays will touch both
> 10 new nodes AND 10 nodes which are ready for deployment, potentially
> making the latter not ready for deployment any more (and definitely
> moving them out of pool for some time).
> 
> Particularly, any manual configuration made by an operator before making
> nodes "available" may get destroyed.
> 
> (2) TripleO has to disable automated cleaning. Automated cleaning is a
> set of steps (currently only wiping the hard drive) that happens in
> ironic 1) before nodes are available, 2) after an instance is deleted.
> As TripleO CLI constantly moves nodes back-and-forth from and to
> "available" state, cleaning kicks in every time. Unless it's disabled.
> 
> Disabling cleaning might sound a sufficient work around, until you need
> it. And you actually do. Here is a real life example of how to get
> yourself broken by not having cleaning:
> 
> a. Deploy an overcloud instance
> b. Delete it
> c. Deploy an overcloud instance on a different hard drive
> d. Boom.
> 
> As we didn't pass cleaning, there is still a config drive on the disk
> used in the first deployment. With 2 config drives present cloud-init
> will pick a random one, breaking the deployment.
> 
> To top it all, TripleO users tend to not use root device hints, so
> switching root disks may happen randomly between deployments. Have fun
> debugging.
> 
> what do you propose?
> 
> 
> I would like the new TripleO mistral workflows to start following the
> ironic state machine closer. Imagine the following workflows:
> 
> 1. register: take JSON, create nodes in "manageable" state. I do believe
> we can automate the enroll->manageable transition, as it serves the
> purpose of validation (and discovery, but lets pu

Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-20 Thread Dmitry Tantsur

On 05/20/2016 02:54 PM, Steven Hardy wrote:

Hi Dmitry,

Thanks for the detailed write-up, some comments below:

On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote:


what do you propose?


I would like the new TripleO mistral workflows to start following the ironic
state machine closer. Imagine the following workflows:

1. register: take JSON, create nodes in "manageable" state. I do believe we
can automate the enroll->manageable transition, as it serves the purpose of
validation (and discovery, but lets put it aside).

2. provide: take a list of nodes or all "managable" nodes and move them to
"available". By using this workflow an operator will make a *conscious*
decision to add some nodes to the cloud.

3. introspect: take a list of "managable" (!!!) nodes or all "manageable"
nodes and move them through introspection. This is an optional step between
"register" and "provide".

4. set_node_state: a helper workflow to move nodes between states. The
"provide" workflow is essentially set_node_state with verb=provide, but is
separate due to its high importance in the node lifecycle.

5. configure: given a couple of parameters (deploy image, local boot flag,
etc), update given or all "manageable" nodes with them.

Essentially the only addition here is the "provide" action which I hope you
already realize should be an explicit step.

what about tripleoclient


Of course we want to keep backward compatibility. The existing commands

 openstack baremetal import
 openstack baremetal configure boot
 openstack baremetal introspection bulk start

will use some combinations of workflows above and will be deprecated.

The new commands (also avoiding hijacking into the bare metal namespaces)
will be provided strictly matching the workflows (especially in terms of the
state machine):

 openstack overcloud node import
 openstack overcloud node configure
 openstack overcloud node introspect
 openstack overcloud node provide


So, provided we maintain backwards compatibility this sounds OK, but one
question - is there any alternative approach that might solve this problem
more generally, e.g not only for TripleO?


I was thinking about that.

We could move the import command to ironicclient, but it won't support 
TripleO format and additions then. It's still a good thing to have, I'll 
talk about it upstream.


As to introspect and provide, the only thing which is different from 
ironic analogs is that ironic commands don't act on "all nodes in XXX 
state", and I don't think we ever will.




Given that we're likely to implement these workflows in mistral, it
probably does make sense to switch to a TripleO specific namespace, but I
can't help wondering if we're solving a general problem in a TripleO
specific way - e.g isn't this something any user adding nodes from an
inventory, introspecting them and finally making them available for
deployment going to need?

Also, and it may be too late to fix this, "openstack overcloud node" is
kinda strange, because we're importing nodes on the undercloud, which could
in theory be used for any purpose, not only overcloud deployments.


I agree but keeping our stuff in ironic's namespace leads to even more 
confusion and even potential conflicts (e.g. we can't introduce 
"baremetal import", cause tripleo reserved it).




We've already done arguably the wrong thing with e.g openstack overcloud image
upload (which, actually, uploads images to the undercloud), but I wanted to
point out that we're maintaining that inconsistency with your proposed
interface (which may be the least-bad option I suppose).

Thanks,

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-20 Thread Steven Hardy
Hi Dmitry,

Thanks for the detailed write-up, some comments below:

On Thu, May 19, 2016 at 03:31:36PM +0200, Dmitry Tantsur wrote:

> what do you propose?
> 
> 
> I would like the new TripleO mistral workflows to start following the ironic
> state machine closer. Imagine the following workflows:
> 
> 1. register: take JSON, create nodes in "manageable" state. I do believe we
> can automate the enroll->manageable transition, as it serves the purpose of
> validation (and discovery, but lets put it aside).
> 
> 2. provide: take a list of nodes or all "managable" nodes and move them to
> "available". By using this workflow an operator will make a *conscious*
> decision to add some nodes to the cloud.
> 
> 3. introspect: take a list of "managable" (!!!) nodes or all "manageable"
> nodes and move them through introspection. This is an optional step between
> "register" and "provide".
> 
> 4. set_node_state: a helper workflow to move nodes between states. The
> "provide" workflow is essentially set_node_state with verb=provide, but is
> separate due to its high importance in the node lifecycle.
> 
> 5. configure: given a couple of parameters (deploy image, local boot flag,
> etc), update given or all "manageable" nodes with them.
> 
> Essentially the only addition here is the "provide" action which I hope you
> already realize should be an explicit step.
> 
> what about tripleoclient
> 
> 
> Of course we want to keep backward compatibility. The existing commands
> 
>  openstack baremetal import
>  openstack baremetal configure boot
>  openstack baremetal introspection bulk start
> 
> will use some combinations of workflows above and will be deprecated.
> 
> The new commands (also avoiding hijacking into the bare metal namespaces)
> will be provided strictly matching the workflows (especially in terms of the
> state machine):
> 
>  openstack overcloud node import
>  openstack overcloud node configure
>  openstack overcloud node introspect
>  openstack overcloud node provide

So, provided we maintain backwards compatibility this sounds OK, but one
question - is there any alternative approach that might solve this problem
more generally, e.g not only for TripleO?

Given that we're likely to implement these workflows in mistral, it
probably does make sense to switch to a TripleO specific namespace, but I
can't help wondering if we're solving a general problem in a TripleO
specific way - e.g isn't this something any user adding nodes from an
inventory, introspecting them and finally making them available for
deployment going to need?

Also, and it may be too late to fix this, "openstack overcloud node" is
kinda strange, because we're importing nodes on the undercloud, which could
in theory be used for any purpose, not only overcloud deployments.

We've already done arguably the wrong thing with e.g openstack overcloud image
upload (which, actually, uploads images to the undercloud), but I wanted to
point out that we're maintaining that inconsistency with your proposed
interface (which may be the least-bad option I suppose).

Thanks,

Steve

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-20 Thread Dmitry Tantsur

On 05/20/2016 01:44 PM, Dan Prince wrote:

On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote:

Hi all!

We started some discussions on https://review.openstack.org/#/c/30020
0/
about the future of node management (registering, configuring and
introspecting) in the new API, but I think it's more fair (and
convenient) to move it here. The goal is to fix several long-
standing
design flaws that affect the logic behind tripleoclient. So fasten
your
seatbelts, here it goes.

If you already understand why we need to change this logic, just
scroll
down to "what do you propose?" section.

"introspection bulk start" is evil
--

As many of you obviously know, TripleO used the following command
for
introspection:

  openstack baremetal introspection bulk start

As not everyone knows though, this command does not come from
ironic-inspector project, it's part of TripleO itself. And the
ironic
team has some big problems with it.

The way it works is

1. Take all nodes in "available" state and move them to "manageable"
state
2. Execute introspection for all nodes in "manageable" state
3. Move all nodes with successful introspection to "available" state.

Step 3 is pretty controversial, step 1 is just horrible. This not
how
the ironic-inspector team designed introspection to work (hence it
refuses to run on nodes in "available" state), and that's now how
the
ironic team expects the ironic state machine to be handled. To
explain
it I'll provide a brief information on the ironic state machine.

ironic node lifecycle
-

With recent versions of the bare metal API (starting with 1.11),
nodes
begin their life in a state called "enroll". Nodes in this state are
not
available for deployment, nor for most of other actions. Ironic does
not
touch such nodes in any way.

To make nodes alive an operator uses "manage" provisioning action to
move nodes to "manageable" state. During this transition the power
and
management credentials (IPMI, SSH, etc) are validated to ensure that
nodes in "manageable" state are, well, manageable. This state is
still
not available for deployment. With nodes in this state an operator
can
execute various pre-deployment actions, such as introspection, RAID
configuration, etc. So to sum it up, nodes in "manageable" state are
being configured before exposing them into the cloud.

The last step before the deployment it to make nodes "available"
using
the "provide" provisioning action. Such nodes are exposed to nova,
and
can be deployed to at any moment. No long-running configuration
actions
should be run in this state. The "manage" action can be used to
bring
nodes back to "manageable" state for configuration (e.g.
reintrospection).

so what's the problem?
--

The problem is that TripleO essentially bypasses this logic by
keeping
all nodes "available" and walking them through provisioning steps
automatically. Just a couple of examples of what gets broken:

(1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for
deployment (including potential autoscaling) and I want to enroll 10
more nodes.

Both introspection and ready-state operations nowadays will touch
both
10 new nodes AND 10 nodes which are ready for deployment,
potentially
making the latter not ready for deployment any more (and definitely
moving them out of pool for some time).

Particularly, any manual configuration made by an operator before
making
nodes "available" may get destroyed.

(2) TripleO has to disable automated cleaning. Automated cleaning is
a
set of steps (currently only wiping the hard drive) that happens in
ironic 1) before nodes are available, 2) after an instance is
deleted.
As TripleO CLI constantly moves nodes back-and-forth from and to
"available" state, cleaning kicks in every time. Unless it's
disabled.

Disabling cleaning might sound a sufficient work around, until you
need
it. And you actually do. Here is a real life example of how to get
yourself broken by not having cleaning:

a. Deploy an overcloud instance
b. Delete it
c. Deploy an overcloud instance on a different hard drive
d. Boom.


This sounds like an Ironic bug to me. Cleaning (wiping a disk) and
removing state that would break subsequent installations on a different
drive are different things. In TripleO I think the reason we disable
cleaning is largely because of the extra time it takes and the fact
that our baremetal cloud isn't multi-tenant (currently at least).


We fix this "bug" by introducing cleaning. This is the process to 
guarantee each deployment starts with a clean environment. It's hard to 
known which remained data can cause which problem (e.g. what about a 
remaining UEFI partition? any remainings of Ceph? I don't know).






As we didn't pass cleaning, there is still a config drive on the
disk
used in the first deployment. With 2 config drives present cloud-
init
will pick a random one, breaking the deployment.


TripleO isn't using config drive is it? Until Nova 

Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-20 Thread Lucas Alvares Gomes
Hi,

> This sounds like an Ironic bug to me. Cleaning (wiping a disk) and
> removing state that would break subsequent installations on a different
> drive are different things. In TripleO I think the reason we disable
> cleaning is largely because of the extra time it takes and the fact
> that our baremetal cloud isn't multi-tenant (currently at least).
>

It's a complicated issue, there are ways in Ironic to make sure the
image will always be deployed onto a specific hard drive [0]. But when
it's not specified Ironic will pick the first disk that appears and in
Linux, at least for SATA, SCSI or IDE disk controllers, the order in
which the devices are added is arbitrary, e.g, /dev/sda and /dev/sdb
could swap around between reboots.

>>
>> As we didn't pass cleaning, there is still a config drive on the
>> disk
>> used in the first deployment. With 2 config drives present cloud-
>> init
>> will pick a random one, breaking the deployment.
>
> TripleO isn't using config drive is it? Until Nova supports config
> drives via Ironic I think we are blocked on using it.
>

It's already supported, for two or more cycles already [1]. The
difference with VMs is that, with baremetal the config drive lives in
the disk as a partition and for VMs it's presented as an external
device.

[0] 
http://docs.openstack.org/developer/ironic/deploy/install-guide.html#specifying-the-disk-for-deployment
[1] 
http://docs.openstack.org/developer/ironic/deploy/install-guide.html#enabling-the-configuration-drive-configdrive

Hope that helps,
Lucas

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-20 Thread Dan Prince
On Thu, 2016-05-19 at 15:31 +0200, Dmitry Tantsur wrote:
> Hi all!
> 
> We started some discussions on https://review.openstack.org/#/c/30020
> 0/ 
> about the future of node management (registering, configuring and 
> introspecting) in the new API, but I think it's more fair (and 
> convenient) to move it here. The goal is to fix several long-
> standing 
> design flaws that affect the logic behind tripleoclient. So fasten
> your 
> seatbelts, here it goes.
> 
> If you already understand why we need to change this logic, just
> scroll 
> down to "what do you propose?" section.
> 
> "introspection bulk start" is evil
> --
> 
> As many of you obviously know, TripleO used the following command
> for 
> introspection:
> 
>   openstack baremetal introspection bulk start
> 
> As not everyone knows though, this command does not come from 
> ironic-inspector project, it's part of TripleO itself. And the
> ironic 
> team has some big problems with it.
> 
> The way it works is
> 
> 1. Take all nodes in "available" state and move them to "manageable"
> state
> 2. Execute introspection for all nodes in "manageable" state
> 3. Move all nodes with successful introspection to "available" state.
> 
> Step 3 is pretty controversial, step 1 is just horrible. This not
> how 
> the ironic-inspector team designed introspection to work (hence it 
> refuses to run on nodes in "available" state), and that's now how
> the 
> ironic team expects the ironic state machine to be handled. To
> explain 
> it I'll provide a brief information on the ironic state machine.
> 
> ironic node lifecycle
> -
> 
> With recent versions of the bare metal API (starting with 1.11),
> nodes 
> begin their life in a state called "enroll". Nodes in this state are
> not 
> available for deployment, nor for most of other actions. Ironic does
> not 
> touch such nodes in any way.
> 
> To make nodes alive an operator uses "manage" provisioning action to 
> move nodes to "manageable" state. During this transition the power
> and 
> management credentials (IPMI, SSH, etc) are validated to ensure that 
> nodes in "manageable" state are, well, manageable. This state is
> still 
> not available for deployment. With nodes in this state an operator
> can 
> execute various pre-deployment actions, such as introspection, RAID 
> configuration, etc. So to sum it up, nodes in "manageable" state are 
> being configured before exposing them into the cloud.
> 
> The last step before the deployment it to make nodes "available"
> using 
> the "provide" provisioning action. Such nodes are exposed to nova,
> and 
> can be deployed to at any moment. No long-running configuration
> actions 
> should be run in this state. The "manage" action can be used to
> bring 
> nodes back to "manageable" state for configuration (e.g.
> reintrospection).
> 
> so what's the problem?
> --
> 
> The problem is that TripleO essentially bypasses this logic by
> keeping 
> all nodes "available" and walking them through provisioning steps 
> automatically. Just a couple of examples of what gets broken:
> 
> (1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for 
> deployment (including potential autoscaling) and I want to enroll 10 
> more nodes.
> 
> Both introspection and ready-state operations nowadays will touch
> both 
> 10 new nodes AND 10 nodes which are ready for deployment,
> potentially 
> making the latter not ready for deployment any more (and definitely 
> moving them out of pool for some time).
> 
> Particularly, any manual configuration made by an operator before
> making 
> nodes "available" may get destroyed.
> 
> (2) TripleO has to disable automated cleaning. Automated cleaning is
> a 
> set of steps (currently only wiping the hard drive) that happens in 
> ironic 1) before nodes are available, 2) after an instance is
> deleted. 
> As TripleO CLI constantly moves nodes back-and-forth from and to 
> "available" state, cleaning kicks in every time. Unless it's
> disabled.
> 
> Disabling cleaning might sound a sufficient work around, until you
> need 
> it. And you actually do. Here is a real life example of how to get 
> yourself broken by not having cleaning:
> 
> a. Deploy an overcloud instance
> b. Delete it
> c. Deploy an overcloud instance on a different hard drive
> d. Boom.

This sounds like an Ironic bug to me. Cleaning (wiping a disk) and
removing state that would break subsequent installations on a different
drive are different things. In TripleO I think the reason we disable
cleaning is largely because of the extra time it takes and the fact
that our baremetal cloud isn't multi-tenant (currently at least).

> 
> As we didn't pass cleaning, there is still a config drive on the
> disk 
> used in the first deployment. With 2 config drives present cloud-
> init 
> will pick a random one, breaking the deployment.

TripleO isn't using config drive is it? Until Nova supports config
drives via Iron

[openstack-dev] [tripleo] Nodes management in our shiny new TripleO API

2016-05-19 Thread Dmitry Tantsur

Hi all!

We started some discussions on https://review.openstack.org/#/c/300200/ 
about the future of node management (registering, configuring and 
introspecting) in the new API, but I think it's more fair (and 
convenient) to move it here. The goal is to fix several long-standing 
design flaws that affect the logic behind tripleoclient. So fasten your 
seatbelts, here it goes.


If you already understand why we need to change this logic, just scroll 
down to "what do you propose?" section.


"introspection bulk start" is evil
--

As many of you obviously know, TripleO used the following command for 
introspection:


 openstack baremetal introspection bulk start

As not everyone knows though, this command does not come from 
ironic-inspector project, it's part of TripleO itself. And the ironic 
team has some big problems with it.


The way it works is

1. Take all nodes in "available" state and move them to "manageable" state
2. Execute introspection for all nodes in "manageable" state
3. Move all nodes with successful introspection to "available" state.

Step 3 is pretty controversial, step 1 is just horrible. This not how 
the ironic-inspector team designed introspection to work (hence it 
refuses to run on nodes in "available" state), and that's now how the 
ironic team expects the ironic state machine to be handled. To explain 
it I'll provide a brief information on the ironic state machine.


ironic node lifecycle
-

With recent versions of the bare metal API (starting with 1.11), nodes 
begin their life in a state called "enroll". Nodes in this state are not 
available for deployment, nor for most of other actions. Ironic does not 
touch such nodes in any way.


To make nodes alive an operator uses "manage" provisioning action to 
move nodes to "manageable" state. During this transition the power and 
management credentials (IPMI, SSH, etc) are validated to ensure that 
nodes in "manageable" state are, well, manageable. This state is still 
not available for deployment. With nodes in this state an operator can 
execute various pre-deployment actions, such as introspection, RAID 
configuration, etc. So to sum it up, nodes in "manageable" state are 
being configured before exposing them into the cloud.


The last step before the deployment it to make nodes "available" using 
the "provide" provisioning action. Such nodes are exposed to nova, and 
can be deployed to at any moment. No long-running configuration actions 
should be run in this state. The "manage" action can be used to bring 
nodes back to "manageable" state for configuration (e.g. reintrospection).


so what's the problem?
--

The problem is that TripleO essentially bypasses this logic by keeping 
all nodes "available" and walking them through provisioning steps 
automatically. Just a couple of examples of what gets broken:


(1) Imagine I have 10 nodes in my overcloud, 10 nodes ready for 
deployment (including potential autoscaling) and I want to enroll 10 
more nodes.


Both introspection and ready-state operations nowadays will touch both 
10 new nodes AND 10 nodes which are ready for deployment, potentially 
making the latter not ready for deployment any more (and definitely 
moving them out of pool for some time).


Particularly, any manual configuration made by an operator before making 
nodes "available" may get destroyed.


(2) TripleO has to disable automated cleaning. Automated cleaning is a 
set of steps (currently only wiping the hard drive) that happens in 
ironic 1) before nodes are available, 2) after an instance is deleted. 
As TripleO CLI constantly moves nodes back-and-forth from and to 
"available" state, cleaning kicks in every time. Unless it's disabled.


Disabling cleaning might sound a sufficient work around, until you need 
it. And you actually do. Here is a real life example of how to get 
yourself broken by not having cleaning:


a. Deploy an overcloud instance
b. Delete it
c. Deploy an overcloud instance on a different hard drive
d. Boom.

As we didn't pass cleaning, there is still a config drive on the disk 
used in the first deployment. With 2 config drives present cloud-init 
will pick a random one, breaking the deployment.


To top it all, TripleO users tend to not use root device hints, so 
switching root disks may happen randomly between deployments. Have fun 
debugging.


what do you propose?


I would like the new TripleO mistral workflows to start following the 
ironic state machine closer. Imagine the following workflows:


1. register: take JSON, create nodes in "manageable" state. I do believe 
we can automate the enroll->manageable transition, as it serves the 
purpose of validation (and discovery, but lets put it aside).


2. provide: take a list of nodes or all "managable" nodes and move them 
to "available". By using this workflow an operator will make a 
*conscious* decision to add some nodes to the clou