Re: [Openstack-operators] [scientific] Ironic Summit recap - ops experiences

2016-05-18 Thread Jim Rollenhagen
I forgot how to reply-all here :)

// jim

On Wed, May 18, 2016 at 05:35:55PM -0400, Jim Rollenhagen wrote:
> On Tue, May 17, 2016 at 11:32:25PM +0100, Stig Telfer wrote:
> > Is there anywhere that these experiences can be captured in a way that 
> > might help?
> > 
> > For example, I have a few DRAC-managed servers.  About half have fallen 
> > into a state where the pxe_drac driver can’t do anything with them 
> > (python-dracclient claims another transaction is underway).  But 
> > pxe_ipmitool works happily.
> > 
> > I’m pretty sure Ironic is not at fault here so it doesn’t seem fair to 
> > catalogue these things as Ironic bugs.  Perhaps the best action would be 
> > for Ironic to be more informative when it identifies a BMC is playing up.
> > 
> > Jay and Jim - any thoughts?
> 
> Yeah, unfortunately we can't fix the terribleness of all the BMCs in the
> world. We are working on a few different efforts to help operators deal
> with these, generally (which are described in my summit wrapup).
> Nova-style notifications, BMC reset APIs, automatically returning nodes
> to service when a BMC is reachable again, etc.
> 
> I'd totally file a bug with python-dracclient for the specific DRAC
> thing you mentioned.
> 
> In general, feel free to file bugs, if it's something we can deal with
> we will triage it, if not we'll keep it in mind for the more general
> handling of these things.
> 
> Does that help?
> 
> // jim
> 
> > 
> > Best wishes,
> > Stig
> > 
> > 
> > > On 12 May 2016, at 11:37, Peter Love  wrote:
> > > 
> > > Nice talk on this stuff: https://www.youtube.com/watch?v=GZeUntdObCA
> > > 
> > > On 12 May 2016 at 10:54, Matt Jarvis  
> > > wrote:
> > >> Very familiar list Tim, and we end up working around a lot of them with
> > >> horrible hardware specific code. Our bugbears also include :
> > >> 
> > >> Required configuration only being available via a web interface - eg.
> > >> setting hostname of the BMC on Supermicro hardware
> > >> IPMI hanging and requiring complete removal and reload of the kernel 
> > >> modules
> > >> to enable resetting
> > >> Undocumented functions requiring raw IPMI commands - again on Supermicro
> > >> there is some black magic to set dedicated ports, check power supply 
> > >> status
> > >> etc.
> > >> Web interfaces requiring Java, and totally broken on mainstream browsers 
> > >> -
> > >> HP ILO's in particular, which are almost impossible to use with a Mac.
> > >> Firmware and BIOS'es which don't allow command line updating from inside 
> > >> a
> > >> running OS
> > >> 
> > >> We're used to being able to flash BIOS images and CMOS settings by 
> > >> writing
> > >> directly to the memory addresses, but more and more modern hardware won't
> > >> let you do this anymore :(
> > >> 
> > >> We're hoping Redfish will solve some of the configuration related issues,
> > >> although obviously it won't make any difference to flaky BMC 
> > >> implementations
> > >> and proprietary tooling to update firmware.
> > >> 
> > >> On 12 May 2016 at 06:25, Tim Bell  wrote:
> > >>> 
> > >>> 
> > >>> 
> > >>> On 12/05/16 06:22, "Stig Telfer"  wrote:
> > >>> 
> >  Hi All -
> >  
> >  Jim Rollenhagen from the Ironic project has just posted a great summit
> >  report of Ironic team activities on the openstack-devs mailing list[1],
> >  which included this item which will be of interest to the Scientific WG
> >  members who are looking to work on bare metal activities this cycle:
> >  
> > > # Making ops less worse
> > > 
> > > [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops)
> > > 
> > > We discussed some common failure cases that operators see, and how we
> > > can solve them in code.
> > > 
> > > We discussed flaky BMCs, which end with the node in maintenance mode,
> > > and if Ironic can get them out of that mode automagically. We
> > > identified
> > > the need to distinguish between maintenance set by ironic and set by
> > > operators, and do things like attempt to connect to the BMC on a power
> > > state request, and turn off maintenance mode if successful. JayF is
> > > going to write a spec for this differentiation.
> > > 
> > > Folks also expressed the desire to be able to reset the BMC via APIs.
> > > We
> > > have a BMC reset function in the vendor interface for the ipmitool
> > > driver; dtantsur volunteered to write a spec to promote that method to
> > > an official ManagementInterface method.
> > > 
> > > We also talked for a while about stuck states. This has been mostly
> > > solved in code, but is still a problem for some deployers. We decided
> > > that we should not have a "reset-state" API like nova does, but rather
> > > a
> > > command line tool to handle this. lintan has volunteered to write a
> > > proposal for 

Re: [Openstack-operators] [scientific] Ironic Summit recap - ops experiences

2016-05-12 Thread Peter Love
Nice talk on this stuff: https://www.youtube.com/watch?v=GZeUntdObCA

On 12 May 2016 at 10:54, Matt Jarvis  wrote:
> Very familiar list Tim, and we end up working around a lot of them with
> horrible hardware specific code. Our bugbears also include :
>
> Required configuration only being available via a web interface - eg.
> setting hostname of the BMC on Supermicro hardware
> IPMI hanging and requiring complete removal and reload of the kernel modules
> to enable resetting
> Undocumented functions requiring raw IPMI commands - again on Supermicro
> there is some black magic to set dedicated ports, check power supply status
> etc.
> Web interfaces requiring Java, and totally broken on mainstream browsers -
> HP ILO's in particular, which are almost impossible to use with a Mac.
> Firmware and BIOS'es which don't allow command line updating from inside a
> running OS
>
> We're used to being able to flash BIOS images and CMOS settings by writing
> directly to the memory addresses, but more and more modern hardware won't
> let you do this anymore :(
>
> We're hoping Redfish will solve some of the configuration related issues,
> although obviously it won't make any difference to flaky BMC implementations
> and proprietary tooling to update firmware.
>
> On 12 May 2016 at 06:25, Tim Bell  wrote:
>>
>>
>>
>> On 12/05/16 06:22, "Stig Telfer"  wrote:
>>
>> >Hi All -
>> >
>> >Jim Rollenhagen from the Ironic project has just posted a great summit
>> > report of Ironic team activities on the openstack-devs mailing list[1],
>> > which included this item which will be of interest to the Scientific WG
>> > members who are looking to work on bare metal activities this cycle:
>> >
>> >> # Making ops less worse
>> >>
>> >> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops)
>> >>
>> >> We discussed some common failure cases that operators see, and how we
>> >> can solve them in code.
>> >>
>> >> We discussed flaky BMCs, which end with the node in maintenance mode,
>> >> and if Ironic can get them out of that mode automagically. We
>> >> identified
>> >> the need to distinguish between maintenance set by ironic and set by
>> >> operators, and do things like attempt to connect to the BMC on a power
>> >> state request, and turn off maintenance mode if successful. JayF is
>> >> going to write a spec for this differentiation.
>> >>
>> >> Folks also expressed the desire to be able to reset the BMC via APIs.
>> >> We
>> >> have a BMC reset function in the vendor interface for the ipmitool
>> >> driver; dtantsur volunteered to write a spec to promote that method to
>> >> an official ManagementInterface method.
>> >>
>> >> We also talked for a while about stuck states. This has been mostly
>> >> solved in code, but is still a problem for some deployers. We decided
>> >> that we should not have a "reset-state" API like nova does, but rather
>> >> a
>> >> command line tool to handle this. lintan has volunteered to write a
>> >> proposal for this; I have also posted some [straw man
>> >> code](https://review.openstack.org/#/c/311273/) that someone is welcome
>> >> to take over or use.
>> >
>> >The operator issues already identified cover some things we’ve hit at
>> > Cambridge, please do scan through and contribute if there is anything they
>> > have not covered.
>> >
>>
>> We have certainly had our share of BMC problems through the years. It is
>> often frustrating as the very time you find you need the console, it is not
>> working. Having Ironic doing an active monitoring (without overloading)
>> would be a real help.
>>
>> The other item we’ve found difficult has been in the configuration:
>>
>> - Software maintenance is very limited. Some vendors choose to produce new
>> versions of the BMC microcode without changing the version number reported
>> by the BMC which makes consistent management difficult. There is no common
>> API defined for updating the code.
>> - Implementations between IPMI 1.5 and IPMI 2.0 vary significantly and
>> between commodity white boxes and blades
>> - BMCs have different Lan channels according to manufacturer for remote
>> access
>> - The tty speeds vary which means that the booted OS needs to have
>> different cmdlines for the kernel according to the underlying hardware
>> - the number of additional accounts is limited in some BMCs and password
>> management is very basic. Currently, we define distinct users for read-only
>> access to the SDRs (e.g. monitoring), console and power operations since
>> these need to be kept in different systems. We also have unique passwords
>> for each machine, all of which requires tracking. Foreman helps here but it
>> is not ideal.
>> - BMC replacement is also frequent. A process to re-import a replacement
>> BMC (new MAC, no user accounts defined) would re-installing the box is
>> needed.
>> - we have a fairly complex reset process which hits the BMC with different
>> 

Re: [Openstack-operators] [scientific] Ironic Summit recap - ops experiences

2016-05-12 Thread Matt Jarvis
Very familiar list Tim, and we end up working around a lot of them with
horrible hardware specific code. Our bugbears also include :

Required configuration only being available via a web interface - eg.
setting hostname of the BMC on Supermicro hardware
IPMI hanging and requiring complete removal and reload of the kernel
modules to enable resetting
Undocumented functions requiring raw IPMI commands - again on Supermicro
there is some black magic to set dedicated ports, check power supply status
etc.
Web interfaces requiring Java, and totally broken on mainstream browsers -
HP ILO's in particular, which are almost impossible to use with a Mac.
Firmware and BIOS'es which don't allow command line updating from inside a
running OS

We're used to being able to flash BIOS images and CMOS settings by writing
directly to the memory addresses, but more and more modern hardware won't
let you do this anymore :(

We're hoping Redfish will solve some of the configuration related issues,
although obviously it won't make any difference to flaky BMC
implementations and proprietary tooling to update firmware.

On 12 May 2016 at 06:25, Tim Bell  wrote:

>
>
> On 12/05/16 06:22, "Stig Telfer"  wrote:
>
> >Hi All -
> >
> >Jim Rollenhagen from the Ironic project has just posted a great summit
> report of Ironic team activities on the openstack-devs mailing list[1],
> which included this item which will be of interest to the Scientific WG
> members who are looking to work on bare metal activities this cycle:
> >
> >> # Making ops less worse
> >>
> >> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops)
> >>
> >> We discussed some common failure cases that operators see, and how we
> >> can solve them in code.
> >>
> >> We discussed flaky BMCs, which end with the node in maintenance mode,
> >> and if Ironic can get them out of that mode automagically. We identified
> >> the need to distinguish between maintenance set by ironic and set by
> >> operators, and do things like attempt to connect to the BMC on a power
> >> state request, and turn off maintenance mode if successful. JayF is
> >> going to write a spec for this differentiation.
> >>
> >> Folks also expressed the desire to be able to reset the BMC via APIs. We
> >> have a BMC reset function in the vendor interface for the ipmitool
> >> driver; dtantsur volunteered to write a spec to promote that method to
> >> an official ManagementInterface method.
> >>
> >> We also talked for a while about stuck states. This has been mostly
> >> solved in code, but is still a problem for some deployers. We decided
> >> that we should not have a "reset-state" API like nova does, but rather a
> >> command line tool to handle this. lintan has volunteered to write a
> >> proposal for this; I have also posted some [straw man
> >> code](https://review.openstack.org/#/c/311273/) that someone is welcome
> >> to take over or use.
> >
> >The operator issues already identified cover some things we’ve hit at
> Cambridge, please do scan through and contribute if there is anything they
> have not covered.
> >
>
> We have certainly had our share of BMC problems through the years. It is
> often frustrating as the very time you find you need the console, it is not
> working. Having Ironic doing an active monitoring (without overloading)
> would be a real help.
>
> The other item we’ve found difficult has been in the configuration:
>
> - Software maintenance is very limited. Some vendors choose to produce new
> versions of the BMC microcode without changing the version number reported
> by the BMC which makes consistent management difficult. There is no common
> API defined for updating the code.
> - Implementations between IPMI 1.5 and IPMI 2.0 vary significantly and
> between commodity white boxes and blades
> - BMCs have different Lan channels according to manufacturer for remote
> access
> - The tty speeds vary which means that the booted OS needs to have
> different cmdlines for the kernel according to the underlying hardware
> - the number of additional accounts is limited in some BMCs and password
> management is very basic. Currently, we define distinct users for read-only
> access to the SDRs (e.g. monitoring), console and power operations since
> these need to be kept in different systems. We also have unique passwords
> for each machine, all of which requires tracking. Foreman helps here but it
> is not ideal.
> - BMC replacement is also frequent. A process to re-import a replacement
> BMC (new MAC, no user accounts defined) would re-installing the box is
> needed.
> - we have a fairly complex reset process which hits the BMC with different
> levels of reset. We’ve also sometimes found the need to reset the IPMI
> kernel modules at the same time which go into a loop.
>
> I’m not expecting Ironic to fix all of this but it would be great to have
> a block of code which we can gradually improve together. There are other
> good 

Re: [Openstack-operators] [scientific] Ironic Summit recap - ops experiences

2016-05-11 Thread Tim Bell


On 12/05/16 06:22, "Stig Telfer"  wrote:

>Hi All - 
>
>Jim Rollenhagen from the Ironic project has just posted a great summit report 
>of Ironic team activities on the openstack-devs mailing list[1], which 
>included this item which will be of interest to the Scientific WG members who 
>are looking to work on bare metal activities this cycle:
>
>> # Making ops less worse
>> 
>> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops)
>> 
>> We discussed some common failure cases that operators see, and how we
>> can solve them in code.
>> 
>> We discussed flaky BMCs, which end with the node in maintenance mode,
>> and if Ironic can get them out of that mode automagically. We identified
>> the need to distinguish between maintenance set by ironic and set by
>> operators, and do things like attempt to connect to the BMC on a power
>> state request, and turn off maintenance mode if successful. JayF is
>> going to write a spec for this differentiation.
>> 
>> Folks also expressed the desire to be able to reset the BMC via APIs. We
>> have a BMC reset function in the vendor interface for the ipmitool
>> driver; dtantsur volunteered to write a spec to promote that method to
>> an official ManagementInterface method.
>> 
>> We also talked for a while about stuck states. This has been mostly
>> solved in code, but is still a problem for some deployers. We decided
>> that we should not have a "reset-state" API like nova does, but rather a
>> command line tool to handle this. lintan has volunteered to write a
>> proposal for this; I have also posted some [straw man
>> code](https://review.openstack.org/#/c/311273/) that someone is welcome
>> to take over or use.
>
>The operator issues already identified cover some things we’ve hit at 
>Cambridge, please do scan through and contribute if there is anything they 
>have not covered.
>

We have certainly had our share of BMC problems through the years. It is often 
frustrating as the very time you find you need the console, it is not working. 
Having Ironic doing an active monitoring (without overloading) would be a real 
help.

The other item we’ve found difficult has been in the configuration:

- Software maintenance is very limited. Some vendors choose to produce new 
versions of the BMC microcode without changing the version number reported by 
the BMC which makes consistent management difficult. There is no common API 
defined for updating the code.
- Implementations between IPMI 1.5 and IPMI 2.0 vary significantly and between 
commodity white boxes and blades
- BMCs have different Lan channels according to manufacturer for remote access
- The tty speeds vary which means that the booted OS needs to have different 
cmdlines for the kernel according to the underlying hardware
- the number of additional accounts is limited in some BMCs and password 
management is very basic. Currently, we define distinct users for read-only 
access to the SDRs (e.g. monitoring), console and power operations since these 
need to be kept in different systems. We also have unique passwords for each 
machine, all of which requires tracking. Foreman helps here but it is not ideal.
- BMC replacement is also frequent. A process to re-import a replacement BMC 
(new MAC, no user accounts defined) would re-installing the box is needed.
- we have a fairly complex reset process which hits the BMC with different 
levels of reset. We’ve also sometimes found the need to reset the IPMI kernel 
modules at the same time which go into a loop.

I’m not expecting Ironic to fix all of this but it would be great to have a 
block of code which we can gradually improve together. There are other good 
initiatives like OpenBMC but they won’t help with the existing boxes.

I think my best advice to Ironic for BMC management would be consider the BMC 
as a potentially unreliable device. Thus, along with performing the actions, 
checking they completed and probing that a function which was working an hour 
ago is still working now (but not overloading it)… we’ll be looking at Ironic 
this year so we’ll be able to help on the failure cases.

Tim

>Best wishes,
>Stig
>
>[1] http://lists.openstack.org/pipermail/openstack-dev/2016-May/094658.html 
>___
>OpenStack-operators mailing list
>OpenStack-operators@lists.openstack.org
>http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators


[Openstack-operators] [scientific] Ironic Summit recap - ops experiences

2016-05-11 Thread Stig Telfer
Hi All - 

Jim Rollenhagen from the Ironic project has just posted a great summit report 
of Ironic team activities on the openstack-devs mailing list[1], which included 
this item which will be of interest to the Scientific WG members who are 
looking to work on bare metal activities this cycle:

> # Making ops less worse
> 
> [Etherpad](https://etherpad.openstack.org/p/ironic-newton-summit-ops)
> 
> We discussed some common failure cases that operators see, and how we
> can solve them in code.
> 
> We discussed flaky BMCs, which end with the node in maintenance mode,
> and if Ironic can get them out of that mode automagically. We identified
> the need to distinguish between maintenance set by ironic and set by
> operators, and do things like attempt to connect to the BMC on a power
> state request, and turn off maintenance mode if successful. JayF is
> going to write a spec for this differentiation.
> 
> Folks also expressed the desire to be able to reset the BMC via APIs. We
> have a BMC reset function in the vendor interface for the ipmitool
> driver; dtantsur volunteered to write a spec to promote that method to
> an official ManagementInterface method.
> 
> We also talked for a while about stuck states. This has been mostly
> solved in code, but is still a problem for some deployers. We decided
> that we should not have a "reset-state" API like nova does, but rather a
> command line tool to handle this. lintan has volunteered to write a
> proposal for this; I have also posted some [straw man
> code](https://review.openstack.org/#/c/311273/) that someone is welcome
> to take over or use.

The operator issues already identified cover some things we’ve hit at 
Cambridge, please do scan through and contribute if there is anything they have 
not covered.

Best wishes,
Stig

[1] http://lists.openstack.org/pipermail/openstack-dev/2016-May/094658.html 
___
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators