2014-08-05 23:47 GMT+02:00 Satrajit Ghosh <[email protected]>: > hi > > out cluster is setup with the configuration below. yet we have been having > a lot of jobs cancelled when preempted: > > slurmd[node004]: *** JOB 79188 CANCELLED AT 2014-08-05T15:31:41 DUE TO > PREEMPTION *** > i thought the settings would simply suspend the job instead of canceling > it. > > cheers, > > satra > > Partial configuration > --------------------------- > > PreemptMode=GANG,SUSPEND > > PreemptType=preempt/partition_prio > > # default > > SchedulerTimeSlice=30 > > DefMemPerCPU=2048 > > DefMemPerNode=2048 > > PartitionName=DEFAULT MaxTime=7-0 DefaultTime=24:00:00 > > # Partitions > > PartitionName=defq Default=NO MinNodes=1 DefaultTime=1-00:00:00 > MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO > RootOnly=NO Hidden=YES Shared=NO GraceTime=0 ReqResv=NO > PreemptMode=GANG,SUSPEND State=UP > > PartitionName=om_all_nodes Default=YES MinNodes=1 DefaultTime=1-00:00:00 > MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO > RootOnly=NO Hidden=NO Shared=FORCE:4 GraceTime=0 ReqResv=NO > PreemptMode=GANG,SUSPEND State=UP Nodes=node[001-030] > > PartitionName=om_interactive Default=NO MinNodes=1 MaxNodes=1 > DefaultTime=01:00:00 MaxTime=01:00:00 AllowGroups=ALL Priority=10 > DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=FORCE:1 GraceTime=0 > MaxCPUsPerNode=32 ReqResv=NO PreemptMode=GANG,SUSPEND State=UP Nodes=node017 > > > If I remember the logic correctly it will try to suspend you job, but if the plugin (proctrack?) will fail to suspend you job, the job will be killed.
Are you using cgroups freezer or SIGSTOP to suspend you jobs? hope this can help marcin
