Hello, I'm still struggling with power saving, and it would be really useful to be able use it reliably on our test cluster. I've made some progress, however I'm still baffled.
First of all, I assumed that "SuspendExcNodes=red0[001-910]" was an exclusion policy. It seems not to work as expected. Slurm is trying to power down my idle red nodes... Thu Mar 16 09:11:28 GMT 2017 Suspend invoked /local/software/slurm/default/etc/node_poweroff.slurm red[0318-0343] red0318 red0319 red0320 red0321 red0322 red0323 red0324 red0325 red0326 red0327 red0328 red0329 red0330 red0331 red0332 red033 3 red0334 red0335 red0336 red0337 red0338 red0339 red0340 red0341 red0342 red0343 2017-03-16 09:11:28 power down red0318 2017-03-16 09:11:55 power down red0319 2017-03-16 09:12:22 power down red0320 2017-03-16 09:12:49 power down red0321 2017-03-16 09:13:16 power down red0322 The other issue is that, for some reason, my second script isn't being executed. I have modified visudo like as follows to no avail. User hpc is the slurm user Defaults:hpc !requiretty hpc ALL=(ALL) NOPASSWD: ALL How can I reliably debug this part of the process? Does power saving actually work reliably? My experiences would seem to suggest otherwise. I would appreciate any advice that anyone could give, please. Best regards, David From: Baker D.J. Sent: Wednesday, March 15, 2017 4:35 PM To: slurm-dev <[email protected]> Subject: Struggling with power saving Hello, I'm struggling with slurm power saving at the moment. I have modified my slurm.conf and written some poweron/off scripts. In addition I have given the slurm user appropriate root powers using visudo and even remembered to set the slurm user not to require a TTY by default in /etc/sudoers. Furthermore I have restarted the slurm daemon on the master, and restart slurm on the node that I'm experimenting with (purple015). All other nodes in the cluster are excluded. It looks like power saving is enabled. I see the following messages in the log [2017-03-15T15:55:10.091] Power save mode: 1 nodes [2017-03-15T16:05:11.068] Power save mode: 1 nodes [2017-03-15T16:15:12.188] Power save mode: 1 nodes [2017-03-15T16:25:13.850] Power save mode: 1 nodes On the other hand despite the fact that purple015 is always idle it is never shutdown. Have I misconfigured power saving - extracts from some of my scripts are shown below. The only thing that is bothering me is that "SelectType=select/cons_res" in slurm.conf. I know that the documentation recommends "SelectType=select/linear", however how how does that affect settings like "SelectTypeParameters=CR_CPU"? Any help would be appreciated, please. Best regards, David Extract from slurm.conf SuspendTime=600 SuspendRate=30 ResumeRate=10 SuspendProgram=/local/software/slurm/default/etc/node_poweroff.slurm ResumeProgram=/local/software/slurm/default/etc/node_poweron.slurm SuspendTimeout=120 ResumeTimeout=300 SuspendExcNodes=red0[001-910] #SuspendExcParts= BatchStartTimeout=60 SelectType=select/cons_res Extract from /local/software/slurm/default/etc/node_poweroff.slurm #!/bin/bash set -o nounset echo "`date` Suspend invoked $0 $*" >>/var/log/slurm/powermgmt.log NODES=$(/local/software/slurm/default/bin/scontrol show hostnames $1) echo $NODES >>/var/log/slurm/powermgmt.log for NODE in ${NODES}; do sudo /local/software/slurm/default/etc/node_poweroff ${NODE} done exit 0 Extract from /local/software/slurm/default/etc/node_poweroff #!/bin/bash set -o nounset NODE=${1} #NODE=purple015 echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log ssh ${NODE} "shutdown -h now" sleep 20 ping -c1 ${NODE} >/dev/null 2>&1 [ $? -eq 0 ] && rpower ${NODE} off exit 0
