We use power saving so it definitely works, maybe you should try turning on 
debugging for the controller daemon with scontrol  and checking the log file.

On 28 Aug 2014 19:18, Uwe Sauter <[email protected]> wrote:

Hi all,

(configuration and scripts below text)

I have configured SLURM to power down idle nodes but it probably is
misconfigured. I aim for a configuration where after a certain period
(say 10min) idle nodes are powered down.

As you can see from the configuration below I have SLURM call either
"node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts
that handle the conversion of SLURM's nodelist syntax and call
"node_poweroff" or "node_poweron" for each node.

"node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log
so I can follow and in the future analyze which nodes were turned off
and on.

The current situation is that although I see 36 out of 54 nodes in a
IDLE+POWER state all nodes are powered on and accessible via SSH.

Output from "grep -i power /var/log/slurm/slurmctld.log | tail"

[2014-08-28T12:01:24.975] Power save mode: 30 nodes
[2014-08-28T12:11:44.080] Power save mode: 30 nodes
[2014-08-28T12:22:44.194] Power save mode: 30 nodes
[2014-08-28T12:33:44.306] Power save mode: 30 nodes
[2014-08-28T12:44:01.425] Power save mode: 30 nodes
[2014-08-28T12:51:44.514] power_save: suspending nodes
n[510301,510601,511901]
[2014-08-28T12:54:26.547] Power save mode: 33 nodes
[2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501]
[2014-08-28T12:57:08.581] power_save: suspending nodes n510901
[2014-08-28T13:05:10.666] Power save mode: 36 nodes

Output from "tail /var/log/slurm/powermgmt.log"

2014-08-27 16:39:36 power on   n512501
2014-08-27 16:51:17 power on   n512601
2014-08-27 17:59:38 power on   n512601
2014-08-28 09:05:54 power on   n511101
2014-08-28 09:06:05 power on   n511201
2014-08-28 09:06:11 power on   n512001
2014-08-28 09:06:19 power on   n512201
2014-08-28 10:41:51 power on   n510501
2014-08-28 10:41:51 power on   n510701
2014-08-28 11:31:41 power on   n511101

grep does not find "down" in /var/log/slurm/powermgmt.log which it
should if "node_poweroff" has been executed.

My impression is that something (misconfiguration? bad sudo
configuration? other right stuff?) doesn't allow SLURM to execute one of
the mentioned scripts.

Can someone check my configuration and give some advice on how to debug
this issue further?


Thank you,

        Uwe


### slurm.conf excerpt ###

# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendTime=600
SuspendRate=30
ResumeRate=10
SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm
ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm
SuspendTimeout=120
ResumeTimeout=300
#SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01]
#SuspendExcParts=
BatchStartTimeout=60

##########################

### /opt/system/slurm/etc/node_poweroff.slurm ###

#!/bin/bash
set -o nounset

NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)

for NODE in ${NODES}; do
  sudo /opt/system/slurm/etc/node_poweroff ${NODE}
done

exit 0

#################################################

### /opt/system/slurm/etc/node_poweron.slurm ###

#!/bin/bash
set -o nounset

NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)

for NODE in ${NODES}; do
  /opt/system/slurm/etc/node_poweron ${NODE}
done

#################################################

### /opt/system/slurm/etc/node_poweroff ###

#!/bin/bash
set -o nounset

NODE=$1

echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log

ssh ${NODE} "/etc/init.d/lustre_client stop"
ssh ${NODE} "umount /localscratch /nfs/*"
ssh ${NODE} "service slurm stop"
ssh ${NODE} "service munge stop"
ssh ${NODE} "poweroff"

sleep 10

ping -c1 ${NODE} >/dev/null 2>&1
[ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H
${NODE}-bmc power off

exit 0

#############################################

### /opt/system/slurm/etc/node_poweron ###

#!/bin/bash
set -o nounset

NODE=${1}

echo "$(date +'%F %T') power on   ${NODE}" >> /var/log/slurm/powermgmt.log

/usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on

exit 0


##########################################

### /etc/sudoers excerpt ###

slurm           ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron
slurm           ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff

############################

________________________________


This email and any files transmitted with it are confidential and are intended 
solely for the use of the individual or entity to whom they are addressed. If 
you are not the original recipient or the person responsible for delivering the 
email to the intended recipient, be advised that you have received this email 
in error, and that any use, dissemination, forwarding, printing, or copying of 
this email is strictly prohibited. If you received this email in error, please 
immediately notify the sender and delete the original.

Reply via email to