Re: [slurm-dev] Fwd: Problem with suspend (Preemption mode), requesting help with the configuration.

Daniel Adriano Silva M Wed, 09 Mar 2011 21:41:34 -0800

Hi,

Thanks for the answer, but I got another question from it. Usually when I
modify something in the slurm I run reconfigure for the changes to take
place, however you point that I should not even lose running jobs, how that
works? I mean, more generally, is it possible to change configurations like
create new partitions or change partitions settings without destroy actually
running jobs?


Thanks,
Daniel

2011/3/9 Jette, Moe <[email protected]>

> Daniel,
>
> You should not even lose any running jobs from the upgrade.
> The only possible problem is any of your applications that link
> directly to libslurm (i.e. directly use the SLURM APIs).
>
> Moe
>
>
> From: [email protected] [[email protected]] On
> Behalf Of Daniel Adriano Silva M [[email protected]]
> Sent: Tuesday, March 08, 2011 8:48 PM
> To: [email protected]
> Subject: Re: [slurm-dev] Fwd: Problem with suspend (Preemption mode),
> requesting help with the configuration.
>
> Moe,
>
> Thanks, I will compile the new version and let you know if this has been
> fixed. BTW, I have a second question if I change the version of slurm, will
> it be necessary to recompile those programs that use srun as mpi?
>
> Daniel
>
> 2011/3/8 Jette, Moe <[email protected]<mailto:[email protected]>>
> Hi Daniel,
>
> SLURM v2.2 contains quite a few bug fixes related to job preemption and
> gang scheduling.
> It will almost certainly fix this problem, but if it does not then we'll
> work with you on a fix.
>
> Moe
>
> ________________________________________
> From: [email protected]<mailto:[email protected]>
> [[email protected]<mailto:[email protected]>] On
> Behalf Of Daniel Adriano Silva M [[email protected]<mailto:
> [email protected]>]
> Sent: Monday, March 07, 2011 10:56 PM
> To: slurm-dev
> Subject: [slurm-dev] Fwd: Problem with suspend (Preemption mode),
> requesting help with the configuration.
>
> Hi Devs,
>
> Sorry for bother you with this small problem, but it has not been possible
> for me to figure what is happening here. We are running an small cluster at
> the HKUST and I tried to enable the Preemption in order to suspend low
> priority jobs by those with high-priority. I had previously set this config
> using an older version of the program, but in 2.1.14 I experience the next
> annoying behavior:
>
>  When I submit the higher priority job, in fact it stops the lower priority
> one. But, latter after +/- 60 sec (as the time-slice states) it happens that
> booth jobs start running, i.e. the suspended low-priority job starts
> running/suspended in 60 sec/intervals, but the high priority job is running
> all the time! So by 60 sec intervals I have booth jobs overlapping. Hence it
> looks like it is still working like some kind of Gang, but I have the
> maxshare set to 1, which in theory will disable the gang and just preempt
> low priority jobs, but what is worst is that this is a broken-gang since the
> jobs overlap in a running status. More interesting is that if I look to the
> squeue it shows that the low priority job is in fact suspended all the time,
> but in the computing node it is alternating running as described. The
> behavior I want is to suspend the low priority job until the high priority
> finish. May you help me?
>
>
> Thanks,
> Daniel Silva
>
> AccountingStorageBackupHost = (null)
> AccountingStorageEnforce = associations,limits
> AccountingStorageHost   = compbio0
> AccountingStorageLoc    = N/A
> AccountingStoragePass   = (null)
> AccountingStoragePort   = 7031
> AccountingStorageType   = accounting_storage/slurmdbd
> AccountingStorageUser   = N/A
> AuthType                = auth/munge
> BackupAddr              = (null)
> BackupController        = (null)
> BatchStartTimeout       = 10 sec
> BOOT_TIME               = 2010-11-08T20:28:55
> CacheGroups             = 0
> CheckpointType          = checkpoint/none
> ClusterName             = bbrc
> CompleteWait            = 0 sec
> ControlAddr             = 10.1.1.1
> ControlMachine          = compbio
> CryptoType              = crypto/munge
> DebugFlags              = (null)
> DefMemPerCPU            = 500
> DisableRootJobs         = NO
> EnforcePartLimits       = NO
> Epilog                  = (null)
> EpilogMsgTime           = 2000 usec
> EpilogSlurmctld         = (null)
> FastSchedule            = 1
> FirstJobId              = 1
> GetEnvTimeout           = 2 sec
> HealthCheckInterval     = 0 sec
> HealthCheckProgram      = (null)
> InactiveLimit           = 0 sec
> JobAcctGatherFrequency  = 30 sec
> JobAcctGatherType       = jobacct_gather/none
> JobCheckpointDir        = /var/slurm/checkpoint
> JobCompHost             = localhost
> JobCompLoc              = /var/log/slurm_jobcomp.log
> JobCompPass             = (null)
> JobCompPort             = 0
> JobCompType             = jobcomp/none
> JobCompUser             = root
> JobCredentialPrivateKey = /etc/slurm/slurm.key
> JobCredentialPublicCertificate = /etc/slurm/slurm.cert
> JobFileAppend           = 0
> JobRequeue              = 1
> KillOnBadExit           = 0
> KillWait                = 30 sec
> Licenses                = (null)
> MailProg                = /bin/mail
> MaxJobCount             = 5000
> MaxMemPerCPU            = 2000
> MaxTasksPerNode         = 128
> MessageTimeout          = 10 sec
> MinJobAge               = 300 sec
> MpiDefault              = none
> MpiParams               = (null)
> NEXT_JOB_ID             = 18406
> OverTimeLimit           = 0 min
> PluginDir               = /usr/lib64/slurm
> PlugStackConfig         = /etc/slurm/plugstack.conf
> PreemptMode             = GANG,SUSPEND
> PreemptType             = preempt/partition_prio
> PriorityType            = priority/basic
> PrivateData             = none
> ProctrackType           = proctrack/pgid
> Prolog                  = (null)
> PrologSlurmctld         = (null)
> PropagatePrioProcess    = 0
> PropagateResourceLimits = ALL
> PropagateResourceLimitsExcept = (null)
> ResumeProgram           = (null)
> ResumeRate              = 300 nodes/min
> ResumeTimeout           = 60 sec
> ResvOverRun             = 0 min
> ReturnToService         = 2
> SallocDefaultCommand    = (null)
> SchedulerParameters     = (null)
> SchedulerPort           = 7321
> SchedulerRootFilter     = 1
> SchedulerTimeSlice      = 60 sec
> SchedulerType           = sched/backfill
> SelectType              = select/cons_res
> SelectTypeParameters    = CR_CORE_MEMORY
> SlurmUser               = slurm(504)
> SlurmctldDebug          = 3
> SlurmctldLogFile        = (null)
> SlurmctldPidFile        = /var/run/slurmctld.pid
> SlurmctldPort           = 6817
> SlurmctldTimeout        = 300 sec
> SlurmdDebug             = 3
> SlurmdLogFile           = (null)
> SlurmdPidFile           = /var/run/slurmd.pid
> SlurmdPort              = 6818
> SlurmdSpoolDir          = /tmp/slurmd
> SlurmdTimeout           = 300 sec
> SlurmdUser              = root(0)
> SLURM_CONF              = /etc/slurm/slurm.conf
> SLURM_VERSION           = 2.1.14
> SrunEpilog              = (null)
> SrunProlog              = (null)
> StateSaveLocation       = /tmp
> SuspendExcNodes         = (null)
> SuspendExcParts         = (null)
> SuspendProgram          = (null)
> SuspendRate             = 60 nodes/min
> SuspendTime             = NONE
> SuspendTimeout          = 30 sec
> SwitchType              = switch/none
> TaskEpilog              = (null)
> TaskPlugin              = task/affinity
> TaskPluginParam         = (null type)
> TaskProlog              = (null)
> TmpFS                   = /tmp
> TopologyPlugin          = topology/none
> TrackWCKey              = 0
> TreeWidth               = 50
> UsePam                  = 0
> UnkillableStepProgram   = (null)
> UnkillableStepTimeout   = 60 sec
> WaitTime                = 0 sec
>
> Slurmctld(primary/backup) at compbio/(NULL) are UP/DOWN
>
> PartitionName=ib-high           Priority=5 Nodes=node-ib-[1-46]
> Default=NO  Shared=FORCE:1  MaxTime=INFINITE State=UP AllowGroups=chemHuang
> PartitionName=ib-default        Priority=3 Nodes=node-ib-[1-46]
> Default=YES Shared=FORCE:1  MaxTime=INFINITE State=UP AllowGroups=ALL
>
>
>
>
>
>
>

Re: [slurm-dev] Fwd: Problem with suspend (Preemption mode), requesting help with the configuration.

Reply via email to