Hi all,
just to let you know the solution:
Standard CentOS sudo is configured to require a TTY (/etc/sudoers:
Defaults requiretty), which prevented my second script to be executed.
After changing this particular setting SLURM's power management works as
expected.
Regards,
Uwe
Am 01.09.2014 um 02:45 schrieb Franco Broi:
>
> This is my suspend script, maybe your script is being called but for
> some reason it's not executing the do loop and so you don't see the
> messages in your power log.
>
> #!/bin/bash
> echo "`date` Suspend invoked $0 $*" >>/var/log/power_save.log
> hosts=`scontrol show hostnames $1`
> for host in $hosts
> do
> echo sudo /share/system/bin/node_poweroff $host >>/var/log/power_save.log
> sudo /share/system/bin/node_poweroff $host >>/var/log/power_save.log
> done
>
>
>
>
> On Fri, 2014-08-29 at 02:36 -0700, Uwe Sauter wrote:
>> Hi,
>>
>> thanks for the suggestion. Unfortunately I already have set
>> SlurmctldDebug=9.
>>
>> A "grep -i power /var/log/slurm/slurmctld.log | tail" gives:
>>
>> [2014-08-29T09:10:05.202] Power save mode: 31 nodes
>> [2014-08-29T09:12:17.228] power_save: waking nodes n510301
>> [2014-08-29T09:15:56.267] power_save: waking nodes n510401
>> [2014-08-29T09:20:18.321] Power save mode: 29 nodes
>> [2014-08-29T09:23:23.359] power_save: waking nodes n511301
>> [2014-08-29T09:31:05.448] Power save mode: 28 nodes
>> [2014-08-29T09:41:45.535] Power save mode: 28 nodes
>> [2014-08-29T09:49:25.619] power_save: suspending nodes
>> n[511001,511101,511601]
>> [2014-08-29T09:52:07.648] Power save mode: 31 nodes
>> [2014-08-29T09:53:08.656] power_save: waking nodes n511001
>>
>> Taking nodes n[511001,511101,511601] as example I get
>> "scontrol show node $NODE | grep State"
>> n511001: State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
>> n511101: State=IDLE+POWER ThreadsPerCore=2 TmpDisk=0 Weight=1
>> n511601: State=IDLE+POWER ThreadsPerCore=2 TmpDisk=0 Weight=1
>>
>> "ipmitool -Ilanplus -UADMIN -Pxxxxx -H $NODE-bmc power status"
>> n511001: Chassis Power is on
>> n511101: Chassis Power is on
>> n511601: Chassis Power is on
>>
>> which indicates that SLURM tries to shut those nodes down but actually
>> fails. Which seems consistent with my suspicion that one of the scripts
>> doesn't get executed.
>>
>> Executing my script manually successfully shuts down the node:
>>
>> # sudo -u /opt/system/slurm/etc/node_poweroff.slurm n511601
>>
>> But after turning this node on again I get a status of
>> State=DOWN+POWER with Reason=Node unexpectedly rebooted
>> [slurm@2014-08-29T10:20:08]
>>
>> which seems odd as SLURM should know that this node was without power
>> for some time.
>>
>>
>> From this situation I have two issues:
>>
>> 1) How can I debug that SLURM really executes the configured scripts?
>> 2) Should I file a bug report for this "unexpected reboot" behavior? The
>> reboot was not unexpected as SLURM wanted this node to shut down.
>>
>> Regards,
>>
>> Uwe
>>
>>
>>
>>
>>
>> Am 28.08.2014 um 13:42 schrieb Franco Broi:
>>> We use power saving so it definitely works, maybe you should try turning
>>> on debugging for the controller daemon with scontrol and checking the
>>> log file.
>>>
>>> On 28 Aug 2014 19:18, Uwe Sauter <[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> (configuration and scripts below text)
>>>
>>> I have configured SLURM to power down idle nodes but it probably is
>>> misconfigured. I aim for a configuration where after a certain period
>>> (say 10min) idle nodes are powered down.
>>>
>>> As you can see from the configuration below I have SLURM call either
>>> "node_poweroff.slurm" or "node_poweron.slurm" which are wrapper scripts
>>> that handle the conversion of SLURM's nodelist syntax and call
>>> "node_poweroff" or "node_poweron" for each node.
>>>
>>> "node_power{off,on}" log their actions into /var/log/slurm/powermgmt.log
>>> so I can follow and in the future analyze which nodes were turned off
>>> and on.
>>>
>>> The current situation is that although I see 36 out of 54 nodes in a
>>> IDLE+POWER state all nodes are powered on and accessible via SSH.
>>>
>>> Output from "grep -i power /var/log/slurm/slurmctld.log | tail"
>>>
>>> [2014-08-28T12:01:24.975] Power save mode: 30 nodes
>>> [2014-08-28T12:11:44.080] Power save mode: 30 nodes
>>> [2014-08-28T12:22:44.194] Power save mode: 30 nodes
>>> [2014-08-28T12:33:44.306] Power save mode: 30 nodes
>>> [2014-08-28T12:44:01.425] Power save mode: 30 nodes
>>> [2014-08-28T12:51:44.514] power_save: suspending nodes
>>> n[510301,510601,511901]
>>> [2014-08-28T12:54:26.547] Power save mode: 33 nodes
>>> [2014-08-28T12:54:26.547] power_save: suspending nodes n[511101,512501]
>>> [2014-08-28T12:57:08.581] power_save: suspending nodes n510901
>>> [2014-08-28T13:05:10.666] Power save mode: 36 nodes
>>>
>>> Output from "tail /var/log/slurm/powermgmt.log"
>>>
>>> 2014-08-27 16:39:36 power on n512501
>>> 2014-08-27 16:51:17 power on n512601
>>> 2014-08-27 17:59:38 power on n512601
>>> 2014-08-28 09:05:54 power on n511101
>>> 2014-08-28 09:06:05 power on n511201
>>> 2014-08-28 09:06:11 power on n512001
>>> 2014-08-28 09:06:19 power on n512201
>>> 2014-08-28 10:41:51 power on n510501
>>> 2014-08-28 10:41:51 power on n510701
>>> 2014-08-28 11:31:41 power on n511101
>>>
>>> grep does not find "down" in /var/log/slurm/powermgmt.log which it
>>> should if "node_poweroff" has been executed.
>>>
>>> My impression is that something (misconfiguration? bad sudo
>>> configuration? other right stuff?) doesn't allow SLURM to execute one of
>>> the mentioned scripts.
>>>
>>> Can someone check my configuration and give some advice on how to debug
>>> this issue further?
>>>
>>>
>>> Thank you,
>>>
>>> Uwe
>>>
>>>
>>> ### slurm.conf excerpt ###
>>>
>>> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>> SuspendTime=600
>>> SuspendRate=30
>>> ResumeRate=10
>>> SuspendProgram=/opt/system/slurm/etc/node_poweroff.slurm
>>> ResumeProgram=/opt/system/slurm/etc/node_poweron.slurm
>>> SuspendTimeout=120
>>> ResumeTimeout=300
>>> #SuspendExcNodes=n51[03,04,29,30][01],n52[04,05][01]
>>> #SuspendExcParts=
>>> BatchStartTimeout=60
>>>
>>> ##########################
>>>
>>> ### /opt/system/slurm/etc/node_poweroff.slurm ###
>>>
>>> #!/bin/bash
>>> set -o nounset
>>>
>>> NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
>>>
>>> for NODE in ${NODES}; do
>>> sudo /opt/system/slurm/etc/node_poweroff ${NODE}
>>> done
>>>
>>> exit 0
>>>
>>> #################################################
>>>
>>> ### /opt/system/slurm/etc/node_poweron.slurm ###
>>>
>>> #!/bin/bash
>>> set -o nounset
>>>
>>> NODES=$(/opt/system/slurm/default/bin/scontrol show hostnames $1)
>>>
>>> for NODE in ${NODES}; do
>>> /opt/system/slurm/etc/node_poweron ${NODE}
>>> done
>>>
>>> #################################################
>>>
>>> ### /opt/system/slurm/etc/node_poweroff ###
>>>
>>> #!/bin/bash
>>> set -o nounset
>>>
>>> NODE=$1
>>>
>>> echo "$(date +'%F %T') power down ${NODE}" >> /var/log/slurm/powermgmt.log
>>>
>>> ssh ${NODE} "/etc/init.d/lustre_client stop"
>>> ssh ${NODE} "umount /localscratch /nfs/*"
>>> ssh ${NODE} "service slurm stop"
>>> ssh ${NODE} "service munge stop"
>>> ssh ${NODE} "poweroff"
>>>
>>> sleep 10
>>>
>>> ping -c1 ${NODE} >/dev/null 2>&1
>>> [ $? -eq 0 ] && /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H
>>> ${NODE}-bmc power off
>>>
>>> exit 0
>>>
>>> #############################################
>>>
>>> ### /opt/system/slurm/etc/node_poweron ###
>>>
>>> #!/bin/bash
>>> set -o nounset
>>>
>>> NODE=${1}
>>>
>>> echo "$(date +'%F %T') power on ${NODE}" >> /var/log/slurm/powermgmt.log
>>>
>>> /usr/bin/ipmitool -Ilanplus -UADMIN -Pxxxxx -H ${NODE}-bmc power on
>>>
>>> exit 0
>>>
>>>
>>> ##########################################
>>>
>>> ### /etc/sudoers excerpt ###
>>>
>>> slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweron
>>> slurm ALL=NOPASSWD: /opt/system/slurm/etc/node_poweroff
>>>
>>> ############################
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> This email and any files transmitted with it are confidential and are
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. If you are not the original recipient or the person
>>> responsible for delivering the email to the intended recipient, be
>>> advised that you have received this email in error, and that any use,
>>> dissemination, forwarding, printing, or copying of this email is
>>> strictly prohibited. If you received this email in error, please
>>> immediately notify the sender and delete the original.
>>>