We had power saving enabled when we installed our new system (slurm 15.08).
  The config was mostly done by the install team, but the poweroff/on
scripts just power off via IPMI and log it.  When enabled, it did seem to
work properly.

We ended up turning power save off, as we were getting less than
satisfactory scheduling decisions - we use TopologyPlugin=topology/tree to
colocate jobs to as few switches as possible.  Powered off nodes wouldn't
be considered, so jobs would be scattered over multiple switches, rather
than turning on a few nodes on the same switch.


-- 
*Nathan Harper* // IT Systems Lead

*e: * nathan.har...@cfms.org.uk // *t: * 0117 906 1104 // *m: * 07875 510891 //
*w: * www.cfms.org.uk <http://www.cfms.org.uk%22> //
CFMS Services Ltd // Bristol & Bath Science Park // Dirac Crescent // Emersons
Green // Bristol // BS16 7FR

CFMS Services Ltd is registered in England and Wales No 05742022 - a
subsidiary of CFMS Ltd
CFMS Services Ltd registered office // Victoria House // 51 Victoria Street
// Bristol // BS1 6AD

On 16 March 2017 at 09:45, Baker D.J. <d.j.ba...@soton.ac.uk> wrote:

> Hello,
>
>
>
> I’m still struggling with power saving, and it would be really useful to
> be able use it reliably on our test cluster. I’ve made some progress,
> however I’m still baffled.
>
>
>
> First of all, I assumed that “SuspendExcNodes=red0[001-910]” was an
> exclusion policy. It seems not to work as expected. Slurm is trying to
> power down my idle red nodes…
>
>
>
> Thu Mar 16 09:11:28 GMT 2017 Suspend invoked 
> /local/software/slurm/default/etc/node_poweroff.slurm
> red[0318-0343]
>
> red0318 red0319 red0320 red0321 red0322 red0323 red0324 red0325 red0326
> red0327 red0328 red0329 red0330 red0331 red0332 red033
>
> 3 red0334 red0335 red0336 red0337 red0338 red0339 red0340 red0341 red0342
> red0343
>
> 2017-03-16 09:11:28 power down red0318
>
> 2017-03-16 09:11:55 power down red0319
>
> 2017-03-16 09:12:22 power down red0320
>
> 2017-03-16 09:12:49 power down red0321
>
> 2017-03-16 09:13:16 power down red0322
>
>
>
> The other issue is that, for some reason, my second script isn’t being
> executed. I have modified visudo like as follows to no avail. User hpc is
> the slurm user
>
>
>
> Defaults:hpc !requiretty
>
> hpc ALL=(ALL)       NOPASSWD: ALL
>
>
>
> How can I reliably debug this part of the process?
>
>
>
> Does power saving actually work reliably? My experiences would seem to
> suggest otherwise. I would appreciate any advice that anyone could give,
> please.
>
>
>
> Best regards,
>
> David
>
>
>
> *From:* Baker D.J.
> *Sent:* Wednesday, March 15, 2017 4:35 PM
> *To:* slurm-dev <slurm-dev@schedmd.com>
> *Subject:* Struggling with power saving
>
>
>
> Hello,
>
>
>
> I’m struggling with slurm power saving at the moment. I have modified my
> slurm.conf and written some poweron/off scripts. In addition I have given
> the slurm user appropriate root powers using visudo and even remembered to
> set the slurm user not to require a TTY by default in /etc/sudoers.
> Furthermore I have restarted the slurm daemon on the master, and restart
> slurm on the node that I’m experimenting with (purple015). All other nodes
> in the cluster are excluded.
>
>
>
> It looks like power saving is enabled. I see the following messages in the
> log
>
> [2017-03-15T15:55:10.091] Power save mode: 1 nodes
>
> [2017-03-15T16:05:11.068] Power save mode: 1 nodes
>
> [2017-03-15T16:15:12.188] Power save mode: 1 nodes
>
> [2017-03-15T16:25:13.850] Power save mode: 1 nodes
>
>
>
> On the other hand despite the fact that purple015 is always idle it is
> never shutdown. Have I misconfigured power saving – extracts from some of
> my scripts are shown below. The only thing that is bothering me is that
> “SelectType=select/cons_res” in slurm.conf. I know that the documentation
> recommends “SelectType=select/linear”, however how how does that affect
> settings like “SelectTypeParameters=CR_CPU”?
>
>
>
> Any help would be appreciated, please.
>
>
>
> Best regards,
>
> David
>
>
>
> Extract from slurm.conf
>
>
>
> SuspendTime=600
>
> SuspendRate=30
>
> ResumeRate=10
>
> SuspendProgram=/local/software/slurm/default/etc/node_poweroff.slurm
>
> ResumeProgram=/local/software/slurm/default/etc/node_poweron.slurm
>
> SuspendTimeout=120
>
> ResumeTimeout=300
>
> SuspendExcNodes=red0[001-910]
>
> #SuspendExcParts=
>
> BatchStartTimeout=60
>
>
>
> SelectType=select/cons_res
>
>
>
> Extract from /local/software/slurm/default/etc/node_poweroff.slurm
>
>
>
> #!/bin/bash
>
> set -o nounset
>
>
>
> echo "`date` Suspend invoked $0 $*" >>/var/log/slurm/powermgmt.log
>
> NODES=$(/local/software/slurm/default/bin/scontrol show hostnames $1)
>
> echo $NODES >>/var/log/slurm/powermgmt.log
>
>
>
> for NODE in ${NODES}; do
>
>   sudo /local/software/slurm/default/etc/node_poweroff ${NODE}
>
> done
>
>
>
> exit 0
>
>
>
> Extract from /local/software/slurm/default/etc/node_poweroff
>
>
>
> #!/bin/bash
>
> set -o nounset
>
>
>
> NODE=${1}
>
> #NODE=purple015
>
>
>
> echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log
>
>
>
> ssh ${NODE} "shutdown -h now"
>
>
>
> sleep 20
>
>
>
> ping -c1 ${NODE} >/dev/null 2>&1
>
> [ $? -eq 0 ] && rpower ${NODE} off
>
>
>
> exit 0
>
>
>
>
>

Reply via email to