Hello,

I'm still struggling with power saving, and it would be really useful to be 
able use it reliably on our test cluster. I've made some progress, however I'm 
still baffled.

First of all, I assumed that "SuspendExcNodes=red0[001-910]" was an exclusion 
policy. It seems not to work as expected. Slurm is trying to power down my idle 
red nodes...

Thu Mar 16 09:11:28 GMT 2017 Suspend invoked 
/local/software/slurm/default/etc/node_poweroff.slurm red[0318-0343]
red0318 red0319 red0320 red0321 red0322 red0323 red0324 red0325 red0326 red0327 
red0328 red0329 red0330 red0331 red0332 red033
3 red0334 red0335 red0336 red0337 red0338 red0339 red0340 red0341 red0342 
red0343
2017-03-16 09:11:28 power down red0318
2017-03-16 09:11:55 power down red0319
2017-03-16 09:12:22 power down red0320
2017-03-16 09:12:49 power down red0321
2017-03-16 09:13:16 power down red0322

The other issue is that, for some reason, my second script isn't being 
executed. I have modified visudo like as follows to no avail. User hpc is the 
slurm user

Defaults:hpc !requiretty
hpc ALL=(ALL)       NOPASSWD: ALL

How can I reliably debug this part of the process?

Does power saving actually work reliably? My experiences would seem to suggest 
otherwise. I would appreciate any advice that anyone could give, please.

Best regards,
David

From: Baker D.J.
Sent: Wednesday, March 15, 2017 4:35 PM
To: slurm-dev <[email protected]>
Subject: Struggling with power saving

Hello,

I'm struggling with slurm power saving at the moment. I have modified my 
slurm.conf and written some poweron/off scripts. In addition I have given the 
slurm user appropriate root powers using visudo and even remembered to set the 
slurm user not to require a TTY by default in /etc/sudoers. Furthermore I have 
restarted the slurm daemon on the master, and restart slurm on the node that 
I'm experimenting with (purple015). All other nodes in the cluster are excluded.

It looks like power saving is enabled. I see the following messages in the log
[2017-03-15T15:55:10.091] Power save mode: 1 nodes
[2017-03-15T16:05:11.068] Power save mode: 1 nodes
[2017-03-15T16:15:12.188] Power save mode: 1 nodes
[2017-03-15T16:25:13.850] Power save mode: 1 nodes

On the other hand despite the fact that purple015 is always idle it is never 
shutdown. Have I misconfigured power saving - extracts from some of my scripts 
are shown below. The only thing that is bothering me is that 
"SelectType=select/cons_res" in slurm.conf. I know that the documentation 
recommends "SelectType=select/linear", however how how does that affect 
settings like "SelectTypeParameters=CR_CPU"?

Any help would be appreciated, please.

Best regards,
David

Extract from slurm.conf

SuspendTime=600
SuspendRate=30
ResumeRate=10
SuspendProgram=/local/software/slurm/default/etc/node_poweroff.slurm
ResumeProgram=/local/software/slurm/default/etc/node_poweron.slurm
SuspendTimeout=120
ResumeTimeout=300
SuspendExcNodes=red0[001-910]
#SuspendExcParts=
BatchStartTimeout=60

SelectType=select/cons_res

Extract from /local/software/slurm/default/etc/node_poweroff.slurm

#!/bin/bash
set -o nounset

echo "`date` Suspend invoked $0 $*" >>/var/log/slurm/powermgmt.log
NODES=$(/local/software/slurm/default/bin/scontrol show hostnames $1)
echo $NODES >>/var/log/slurm/powermgmt.log

for NODE in ${NODES}; do
  sudo /local/software/slurm/default/etc/node_poweroff ${NODE}
done

exit 0

Extract from /local/software/slurm/default/etc/node_poweroff

#!/bin/bash
set -o nounset

NODE=${1}
#NODE=purple015

echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log

ssh ${NODE} "shutdown -h now"

sleep 20

ping -c1 ${NODE} >/dev/null 2>&1
[ $? -eq 0 ] && rpower ${NODE} off

exit 0


Reply via email to