Re: [openstack-dev] [ironic]Ironic operations on nodes in maintenance mode

2015-12-09 Thread Jim Rollenhagen
Sorry I dropped the ball on this thread. :(

On Tue, Nov 24, 2015 at 11:51:52AM -0800, Shraddha Pandhe wrote:
> On Tue, Nov 24, 2015 at 7:39 AM, Jim Rollenhagen 
> wrote:
> 
> > On Mon, Nov 23, 2015 at 03:35:58PM -0800, Shraddha Pandhe wrote:
> > > Hi,
> > >
> > > I would like to know how everyone is using maintenance mode and what is
> > > expected from admins about nodes in maintenance. The reason I am bringing
> > > up this topic is because, most of the ironic operations, including manual
> > > cleaning are not allowed for nodes in maintenance. Thats a problem for
> > us.
> > >
> > > The way we use it is as follows:
> > >
> > > We allow users to put nodes in maintenance mode (indirectly) if they find
> > > something wrong with the node. They also provide a maintenance reason
> > along
> > > with it, which gets stored as "user_reason" under maintenance_reason. So
> > > basically we tag it as user specified reason.
> > >
> > > To debug what happened to the node our operators use manual cleaning to
> > > re-image the node. By doing this, they can find out all the issues
> > related
> > > to re-imaging (dhcp, ipmi, image transfer, etc). This debugging process
> > > applies to all the nodes that were put in maintenance either by user, or
> > by
> > > system (due to power cycle failure or due to cleaning failure).
> >
> > Interesting; do you let the node go through cleaning between the user
> > nuking the instance and doing this manual cleaning stuff?
> >
> 
> Do you mean automated cleaning? If so, yes we let that go through since
> thats allowed in maintenance mode.

It isn't upstream; all heartbeats are recorded with no action taken for
a long time now.
> 
> >
> > At Rackspace, we leverage the fact that maintenance mode will not allow
> > the node to proceed through the state machine. If a user reports a
> > hardware issue, we maintenance the node on their behalf, and when they
> > delete it, it boots the agent for cleaning and begins heartbeating.
> > Heartbeats are ignored in maintenance mode, which gives us time to
> > investigate the hardware, fix things, etc. When the issue is resolved,
> > we remove maintenance mode, it goes through cleaning, then back in the
> > pool.
> 
> 
> What is the provision state when maintenance mode is removed? Does it
> automatically go back into the available pool? How does a user report a
> hardware issue?

The node remains in cleaning, with the agent heartbeating, until
maintenance mode is removed. Then it goes back through cleaning to
available.
> 
> Due to large scale, we can't always assure that someone will take care of
> the node right away. So we have some automation to make sure that user's
> quota is freed.
> 
> 1. If a user finds some problem with the node, the user calls our break-fix
> extension (with reason for break-fix) which deletes the instance for the
> user and frees the quota.
> 2. Internally nova deletes the instance and calls destroy on virt driver.
> This follows the normal delete flow with automated cleaning.
> 3. We have an automated tool called Reparo which constantly monitors the
> node list for nodes in maintenance mode.
> 4. If it finds any nodes in maintenance, it runs one round of manual
> cleaning on it to check if the issue was transient.
> 5. If cleaning fails, we need someone to take a look at it.
> 6. If cleaning succeeds, we put the node back in available pool.
> 
> This is only way we can scale to hundreds of thousands of nodes. If manual
> cleaning was not allowed in maintenance mode, our operators would hate us :)
> 
> If the provision state of the node is such a way that the node cannot be
> picked up by the scheduler, we can remove maintenance mode and run manual
> cleaning.

Hm, I'm trying to think of a way to make that work without cleaning
allowed in maintenance mode... I haven't got much. We've always
preferred for us (or our automation) to take a look at the node *before*
we do any cleaning on it, as cleaning may mask some of that. The
manageable state is intended to be the provision state you mentioned.
You can move from "clean failed" to manageable, if you could make
something fail the cleaning when the node is in maintenance mode. Might
be the best route here.

// jim

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic]Ironic operations on nodes in maintenance mode

2015-12-08 Thread Ruby Loo
On 23 November 2015 at 18:35, Shraddha Pandhe 
wrote:

> Hi,
>
> I would like to know how everyone is using maintenance mode and what is
> expected from admins about nodes in maintenance. The reason I am bringing
> up this topic is because, most of the ironic operations, including manual
> cleaning are not allowed for nodes in maintenance. Thats a problem for us.
>
>
So what are the reasons for not allowing manual cleaning when a node is in
maintenance (and in manageable state)?

--ruby
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic]Ironic operations on nodes in maintenance mode

2015-11-24 Thread Shraddha Pandhe
On Tue, Nov 24, 2015 at 7:39 AM, Jim Rollenhagen 
wrote:

> On Mon, Nov 23, 2015 at 03:35:58PM -0800, Shraddha Pandhe wrote:
> > Hi,
> >
> > I would like to know how everyone is using maintenance mode and what is
> > expected from admins about nodes in maintenance. The reason I am bringing
> > up this topic is because, most of the ironic operations, including manual
> > cleaning are not allowed for nodes in maintenance. Thats a problem for
> us.
> >
> > The way we use it is as follows:
> >
> > We allow users to put nodes in maintenance mode (indirectly) if they find
> > something wrong with the node. They also provide a maintenance reason
> along
> > with it, which gets stored as "user_reason" under maintenance_reason. So
> > basically we tag it as user specified reason.
> >
> > To debug what happened to the node our operators use manual cleaning to
> > re-image the node. By doing this, they can find out all the issues
> related
> > to re-imaging (dhcp, ipmi, image transfer, etc). This debugging process
> > applies to all the nodes that were put in maintenance either by user, or
> by
> > system (due to power cycle failure or due to cleaning failure).
>
> Interesting; do you let the node go through cleaning between the user
> nuking the instance and doing this manual cleaning stuff?
>

Do you mean automated cleaning? If so, yes we let that go through since
thats allowed in maintenance mode.

>
> At Rackspace, we leverage the fact that maintenance mode will not allow
> the node to proceed through the state machine. If a user reports a
> hardware issue, we maintenance the node on their behalf, and when they
> delete it, it boots the agent for cleaning and begins heartbeating.
> Heartbeats are ignored in maintenance mode, which gives us time to
> investigate the hardware, fix things, etc. When the issue is resolved,
> we remove maintenance mode, it goes through cleaning, then back in the
> pool.


What is the provision state when maintenance mode is removed? Does it
automatically go back into the available pool? How does a user report a
hardware issue?

Due to large scale, we can't always assure that someone will take care of
the node right away. So we have some automation to make sure that user's
quota is freed.

1. If a user finds some problem with the node, the user calls our break-fix
extension (with reason for break-fix) which deletes the instance for the
user and frees the quota.
2. Internally nova deletes the instance and calls destroy on virt driver.
This follows the normal delete flow with automated cleaning.
3. We have an automated tool called Reparo which constantly monitors the
node list for nodes in maintenance mode.
4. If it finds any nodes in maintenance, it runs one round of manual
cleaning on it to check if the issue was transient.
5. If cleaning fails, we need someone to take a look at it.
6. If cleaning succeeds, we put the node back in available pool.

This is only way we can scale to hundreds of thousands of nodes. If manual
cleaning was not allowed in maintenance mode, our operators would hate us :)

If the provision state of the node is such a way that the node cannot be
picked up by the scheduler, we can remove maintenance mode and run manual
cleaning.


> We used to enroll nodes in maintenance mode, back when the API put them
> in the available state immediately, to avoid them being scheduled to
> until we knew they were good to go. The enroll state solved this for us.
>
> Last, we use maintenance mode on available nodes if we want to
> temporarily pull them from the pool for a manual process or some
> testing. This can also be solved by the manageable state.
>
> // jim
>
> __
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic]Ironic operations on nodes in maintenance mode

2015-11-24 Thread Arkady_Kanevsky
Another use cases for maintenance node are:

* HW component replacement, e.g. NIC, or disk

* FW upgrade/downgrade - we should be able to use ironic FW management 
API/CLI for it.

* HW configuration change. Like re-provision server, like changing RAID 
configuration. Again, we should be able to use ironic FW management API/CLI for 
it.

Thanks,
Arkady

-Original Message-
From: Jim Rollenhagen [mailto:j...@jimrollenhagen.com]
Sent: Tuesday, November 24, 2015 9:39 AM
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [ironic]Ironic operations on nodes in maintenance 
mode

On Mon, Nov 23, 2015 at 03:35:58PM -0800, Shraddha Pandhe wrote:
> Hi,
>
> I would like to know how everyone is using maintenance mode and what
> is expected from admins about nodes in maintenance. The reason I am
> bringing up this topic is because, most of the ironic operations,
> including manual cleaning are not allowed for nodes in maintenance. Thats a 
> problem for us.
>
> The way we use it is as follows:
>
> We allow users to put nodes in maintenance mode (indirectly) if they
> find something wrong with the node. They also provide a maintenance
> reason along with it, which gets stored as "user_reason" under
> maintenance_reason. So basically we tag it as user specified reason.
>
> To debug what happened to the node our operators use manual cleaning
> to re-image the node. By doing this, they can find out all the issues
> related to re-imaging (dhcp, ipmi, image transfer, etc). This
> debugging process applies to all the nodes that were put in
> maintenance either by user, or by system (due to power cycle failure or due 
> to cleaning failure).

Interesting; do you let the node go through cleaning between the user nuking 
the instance and doing this manual cleaning stuff?

At Rackspace, we leverage the fact that maintenance mode will not allow the 
node to proceed through the state machine. If a user reports a hardware issue, 
we maintenance the node on their behalf, and when they delete it, it boots the 
agent for cleaning and begins heartbeating.
Heartbeats are ignored in maintenance mode, which gives us time to investigate 
the hardware, fix things, etc. When the issue is resolved, we remove 
maintenance mode, it goes through cleaning, then back in the pool.

We used to enroll nodes in maintenance mode, back when the API put them in the 
available state immediately, to avoid them being scheduled to until we knew 
they were good to go. The enroll state solved this for us.

Last, we use maintenance mode on available nodes if we want to temporarily pull 
them from the pool for a manual process or some testing. This can also be 
solved by the manageable state.

// jim

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ironic]Ironic operations on nodes in maintenance mode

2015-11-24 Thread Jim Rollenhagen
On Mon, Nov 23, 2015 at 03:35:58PM -0800, Shraddha Pandhe wrote:
> Hi,
> 
> I would like to know how everyone is using maintenance mode and what is
> expected from admins about nodes in maintenance. The reason I am bringing
> up this topic is because, most of the ironic operations, including manual
> cleaning are not allowed for nodes in maintenance. Thats a problem for us.
> 
> The way we use it is as follows:
> 
> We allow users to put nodes in maintenance mode (indirectly) if they find
> something wrong with the node. They also provide a maintenance reason along
> with it, which gets stored as "user_reason" under maintenance_reason. So
> basically we tag it as user specified reason.
> 
> To debug what happened to the node our operators use manual cleaning to
> re-image the node. By doing this, they can find out all the issues related
> to re-imaging (dhcp, ipmi, image transfer, etc). This debugging process
> applies to all the nodes that were put in maintenance either by user, or by
> system (due to power cycle failure or due to cleaning failure).

Interesting; do you let the node go through cleaning between the user
nuking the instance and doing this manual cleaning stuff?

At Rackspace, we leverage the fact that maintenance mode will not allow
the node to proceed through the state machine. If a user reports a
hardware issue, we maintenance the node on their behalf, and when they
delete it, it boots the agent for cleaning and begins heartbeating.
Heartbeats are ignored in maintenance mode, which gives us time to
investigate the hardware, fix things, etc. When the issue is resolved,
we remove maintenance mode, it goes through cleaning, then back in the
pool.

We used to enroll nodes in maintenance mode, back when the API put them
in the available state immediately, to avoid them being scheduled to
until we knew they were good to go. The enroll state solved this for us.

Last, we use maintenance mode on available nodes if we want to
temporarily pull them from the pool for a manual process or some
testing. This can also be solved by the manageable state.

// jim

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [ironic]Ironic operations on nodes in maintenance mode

2015-11-23 Thread Shraddha Pandhe
Hi,

I would like to know how everyone is using maintenance mode and what is
expected from admins about nodes in maintenance. The reason I am bringing
up this topic is because, most of the ironic operations, including manual
cleaning are not allowed for nodes in maintenance. Thats a problem for us.

The way we use it is as follows:

We allow users to put nodes in maintenance mode (indirectly) if they find
something wrong with the node. They also provide a maintenance reason along
with it, which gets stored as "user_reason" under maintenance_reason. So
basically we tag it as user specified reason.

To debug what happened to the node our operators use manual cleaning to
re-image the node. By doing this, they can find out all the issues related
to re-imaging (dhcp, ipmi, image transfer, etc). This debugging process
applies to all the nodes that were put in maintenance either by user, or by
system (due to power cycle failure or due to cleaning failure).

This is how we use maintenance mode in Ironic.
__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev