Hi Moe,

Just to confirm - we have upgraded the cluster from SLURM 2.1.16 to SLURM 2.2.1 and the problem we saw with pre-empted nodes being power-saved is no longer occuring so this bug is indeed fixed.

Many thanks,

-stephen

On 10/12/10 17:20, stephen mulcahy wrote:
Hi Moe,

That sounds good - we'll roll the cluster to v2.2 when it is released.

Unfortunately the cluster is back in production now so I won't be able
to test this patch for some time. I'll let you know when I get a chance
to test this how it works.

Thanks,

-stephen

On 10/12/10 16:26, Jette, Moe wrote:
I am not planning to fix this in SLURM version 2.1, but the attached
patch
will be applied to version 2.2. It adds a count of suspended jobs to the
node's data structure and avoids powering down any node with suspended
jobs.
________________________________________
From: [email protected] [[email protected]]
On Behalf Of stephen mulcahy [[email protected]]
Sent: Thursday, December 09, 2010 8:27 AM
To: [email protected]
Subject: [slurm-dev] Preemption and Power Saving

Hi,

I have installed SLURM 2.1.16 and enabled both Preemption and Power
Saving (both of which are really interesting features for us - thanks to
you folks for developging them). Each one seems to be working well as
per our tests. However when I combine in the following scenario I run
into a problem.

Start Job 1 (using 65 of our 70 compute nodes).

Start Job 2 in a higher priority partition (using 40 of our 70 compute
nodes).

Job 2 successfully pre-empts Job 1 and runs to completion. While Job 2
is running, 25 of the nodes running for Job 1 are found to be idle and
are powered down. This will affect our suspended job.

Is this the expected behaviour of SLURM? Is there any way for me to stop
the power management shutting down nodes with suspended jobs?

My Pre-emption config:

# SCHEDULING
#DefMemPerCPU=0
#EnablePreemption=no
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
SelectTypeParameters=CR_Memory
DefMemPerNode=0
PreemptMode=SUSPEND,GANG
PreemptType=preempt/partition_prio


My power-saving config:

# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendProgram=/shared/slurm/power/slurmSuspend.sh
ResumeProgram=/shared/slurm/power/slurmResume.sh
SuspendTimeout=180
ResumeTimeout=300
ResumeRate=100
#SuspendExcNodes=
#SuspendExcParts=
SuspendRate=100
SuspendTime=300


Thanks,

-stephen

--
Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie
Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway)




--
Stephen Mulcahy     Atlantic Linux         http://www.atlanticlinux.ie
Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway)

Reply via email to