I'm going to start working on VCL-169 Event Driven power down...
This is a first step of a larger power management feature.
In this step, I suggest extending health_check.pl script to accept options
for different data center events that would require the hardware to be
shutdown. The events are usually related to heat issues that are detected
within the blade chassis's or other external thermal sensors.
The two primary events are
1)shutdown idle blades (phase 1)
I'm thinking the process is to pull all blades that are idle under the
controlling management node, relocate any upcoming reservations that might
reside on those blades, then proceed to shutdown the blades.
2)shutdown blades currently inuse (phase 2 - phase 1 did not do enough)
This second part would be triggered if and only if event 1 is not
effective. It notifies the user running on the VCL resource about the
unexpected data center problem and then starts a count-down of when the
node will be shutdown. Depending on the reservation type (Long-term vs
short or some other method) - we'll need to address either reclaiming the
blade or just shutting it down and retaining the reservation data by
extending the end time. Then once things are back to normal vcld on start
up will detect these previous reservations and start them back up, then
notify the end-user it is available again.
If there are any thoughts or other suggestions, please feel free to
comment.
Aaron