This is my suspend script, maybe your script is being called but for some reason it's not executing the do loop and so you don't see the messages in your power log.
#!/bin/bash echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log hosts=`scontrol show hostnames $1` for host in $hosts do echo sudo /share/system/bin/node_poweroff $host >>/var/log/power_save.log sudo /share/system/bin/node_poweroff $host >>/var/log/power_save.log done On Fri, 2014-08-29 at 02:36 -0700, Uwe Sauter wrote: > Hi, > > thanks for the suggestion. Unfortunately I already have set > SlurmctldDebug=9. > > A "grep -i power /var/log/slurm/slurmctld.log | tail" gives: > > [2014-08-29T09:10:05.202] Power save mode: 31 nodes > [2014-08-29T09:12:17.228] power_save: waking nodes n510301 > [2014-08-29T09:15:56.267] power_save: waking nodes n510401 > [2014-08-29T09:20:18.321] Power save mode: 29 nodes > [2014-08-29T09:23:23.359] power_save: waking nodes n511301 > [2014-08-29T09:31:05.448] Power save mode: 28 nodes > [2014-08-29T09:41:45.535] Power save mode: 28 nodes > [2014-08-29T09:49:25.619] power_save: suspending nodes > n[511001,511101,511601] > [2014-08-29T09:52:07.648] Power save mode: 31 nodes > [2014-08-29T09:53:08.656] power_save: waking nodes n511001 > > Taking nodes n[511001,511101,511601] as example I get > "scontrol show node $NODE | grep State" > n511001: State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 > n511101: State=IDLE+POWER ThreadsPerCore=2 TmpDisk=0 Weight=1 > n511601: State=IDLE+POWER ThreadsPerCore=2 TmpDisk=0 Weight=1 > > "ipmitool -Ilanplus -UADMIN -Pxxxxx -H $NODE-bmc power status" > n511001: Chassis Power is on > n511101: Chassis Power is on > n511601: Chassis Power is on > > which indicates that SLURM tries to shut those nodes down but actually > fails. Which seems consistent with my suspicion that one of the scripts > doesn't get executed. > > Executing my script manually successfully shuts down the node: > > # sudo -u /opt/system/slurm/etc/node_poweroff.slurm n511601 > > But after turning this node on again I get a status of > State=DOWN+POWER with Reason=Node unexpectedly rebooted > [slurm@2014-08-29T10:20:08] > > which seems odd as SLURM should know that this node was without power > for some time. > > > From this situation I have two issues: > > 1) How can I debug that SLURM really executes the configured scripts? > 2) Should I file a bug report for this "unexpected reboot" behavior? The > reboot was not unexpected as SLURM wanted this node to shut down. > > Regards, > > Uwe > > > > > > Am 28.08.2014 um 13:42 schrieb Franco Broi: > > We use power saving so it definitely works, maybe you should try turning > > on debugging for the controller daemon with scontrol and checking the > > log file. > > > > On 28 Aug 2014 19:18, Uwe Sauter <[email protected]> wrote: > > > > Hi all, > > > > (configuration and scripts below text) > > > > I have configured SLURM to power down idle nodes but it probably is > > misconfigured. I aim for a configuration where after a certain period > > (say 10min) idle nodes are powered down. > > > > As you can see from the configuration below I have SLURM call either > > "node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts > > that handle the conversion of SLURM's nodelist syntax and call > > "node_poweroff" or "node_poweron" for each node. > > > > "node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log > > so I can follow and in the future analyze which nodes were turned off > > and on. > > > > The current situation is that although I see 36 out of 54 nodes in a > > IDLE+POWER state all nodes are powered on and accessible via SSH. > > > > Output from "grep -i power /var/log/slurm/slurmctld.log | tail" > > > > [2014-08-28T12:01:24.975] Power save mode: 30 nodes > > [2014-08-28T12:11:44.080] Power save mode: 30 nodes > > [2014-08-28T12:22:44.194] Power save mode: 30 nodes > > [2014-08-28T12:33:44.306] Power save mode: 30 nodes > > [2014-08-28T12:44:01.425] Power save mode: 30 nodes > > [2014-08-28T12:51:44.514] power_save: suspending nodes > > n[510301,510601,511901] > > [2014-08-28T12:54:26.547] Power save mode: 33 nodes > > [2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501] > > [2014-08-28T12:57:08.581] power_save: suspending nodes n510901 > > [2014-08-28T13:05:10.666] Power save mode: 36 nodes > > > > Output from "tail /var/log/slurm/powermgmt.log" > > > > 2014-08-27 16:39:36 power on n512501 > > 2014-08-27 16:51:17 power on n512601 > > 2014-08-27 17:59:38 power on n512601 > > 2014-08-28 09:05:54 power on n511101 > > 2014-08-28 09:06:05 power on n511201 > > 2014-08-28 09:06:11 power on n512001 > > 2014-08-28 09:06:19 power on n512201 > > 2014-08-28 10:41:51 power on n510501 > > 2014-08-28 10:41:51 power on n510701 > > 2014-08-28 11:31:41 power on n511101 > > > > grep does not find "down" in /var/log/slurm/powermgmt.log which it > > should if "node_poweroff" has been executed. > > > > My impression is that something (misconfiguration? bad sudo > > configuration? other right stuff?) doesn't allow SLURM to execute one of > > the mentioned scripts. > > > > Can someone check my configuration and give some advice on how to debug > > this issue further? > > > > > > Thank you, > > > > Uwe > > > > > > ### slurm.conf excerpt ### > > > > # POWER SAVE SUPPORT FOR IDLE NODES (optional) > > SuspendTime=600 > > SuspendRate=30 > > ResumeRate=10 > > SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm > > ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm > > SuspendTimeout=120 > > ResumeTimeout=300 > > #SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01] > > #SuspendExcParts= > > BatchStartTimeout=60 > > > > ########################## > > > > ### /opt/system/slurm/etc/node_poweroff.slurm ### > > > > #!/bin/bash > > set -o nounset > > > > NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1) > > > > for NODE in ${NODES}; do > > sudo /opt/system/slurm/etc/node_poweroff ${NODE} > > done > > > > exit 0 > > > > ################################################# > > > > ### /opt/system/slurm/etc/node_poweron.slurm ### > > > > #!/bin/bash > > set -o nounset > > > > NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1) > > > > for NODE in ${NODES}; do > > /opt/system/slurm/etc/node_poweron ${NODE} > > done > > > > ################################################# > > > > ### /opt/system/slurm/etc/node_poweroff ### > > > > #!/bin/bash > > set -o nounset > > > > NODE=$1 > > > > echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log > > > > ssh ${NODE} "/etc/init.d/lustre_client stop" > > ssh ${NODE} "umount /localscratch /nfs/*" > > ssh ${NODE} "service slurm stop" > > ssh ${NODE} "service munge stop" > > ssh ${NODE} "poweroff" > > > > sleep 10 > > > > ping -c1 ${NODE} >/dev/null 2>&1 > > [ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H > > ${NODE}-bmc power off > > > > exit 0 > > > > ############################################# > > > > ### /opt/system/slurm/etc/node_poweron ### > > > > #!/bin/bash > > set -o nounset > > > > NODE=${1} > > > > echo "$(date +'%F %T') power on ${NODE}" >> /var/log/slurm/powermgmt.log > > > > /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on > > > > exit 0 > > > > > > ########################################## > > > > ### /etc/sudoers excerpt ### > > > > slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron > > slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff > > > > ############################ > > > > ------------------------------------------------------------------------ > > > > > > This email and any files transmitted with it are confidential and are > > intended solely for the use of the individual or entity to whom they are > > addressed. If you are not the original recipient or the person > > responsible for delivering the email to the intended recipient, be > > advised that you have received this email in error, and that any use, > > dissemination, forwarding, printing, or copying of this email is > > strictly prohibited. If you received this email in error, please > > immediately notify the sender and delete the original. > >
