Yes. Yes for all job types. No, the node's real memory defined elsewhere in Slurm.conf.
On March 10, 2014 6:59:38 PM PDT, "Ryan M. Bergmann" <[email protected]> wrote: > >The partitions are exactly the same except for their priorities. > >I think the problem is due to memory. Am I correct in assuming that >SLURM will not suspend a job if it thinks the node does not have enough >memory to hold both jobs (the preempter and preemptee)? > >One other question, in slurm.conf, is DefMemPerCPU/Node the default >*requested* value on batch jobs which do not explicitly specify it? Or >is this supposed to be the actual memory the node has (which I though >was defined in the partition)? > >Thanks again! > >-Ryan > > > > >On Mar 10, 2014, at 6:53 PM, Moe Jette <[email protected]> wrote: > >> >> Are the job's allocated different nodes or even different cores on >the nodes? If so, they don't need to preempt each other. Also see: >> http://slurm.schedmd.com/preempt.html >> >> Quoting "Ryan M. Bergmann" <[email protected]>: >> >>> Hi dev list, >>> >>> I’m having trouble getting preemption to work in suspend mode. I >have three partitions, one with a priority of 100, one with a priority >of 3, and one with a priority of 1. When users submit jobs on the 100 >priority partition, a job on a lower priority partition is not >suspended. I have SHARED=FORCE:1 for all partitions. Any ideas what >could be happening? The default memory per CPU=2GB, so if users do not >explicitly specify their memory usage, will preemption fail if the high >priority job needs more memory than what is free on the node? >>> >>> Thanks! >>> >>> Ryan >>> >>> >>> My configuration is: >>> >>> AccountingStorageBackupHost = (null) >>> AccountingStorageEnforce = associations,limits >>> AccountingStorageHost = berkelium.berkelium >>> AccountingStorageLoc = N/A >>> AccountingStoragePort = 6819 >>> AccountingStorageType = accounting_storage/slurmdbd >>> AccountingStorageUser = N/A >>> AccountingStoreJobComment = YES >>> AuthType = auth/munge >>> BackupAddr = (null) >>> BackupController = (null) >>> BatchStartTimeout = 10 sec >>> BOOT_TIME = 2014-03-10T18:06:50 >>> CacheGroups = 0 >>> CheckpointType = checkpoint/none >>> ClusterName = cluster >>> CompleteWait = 0 sec >>> ControlAddr = berkelium >>> ControlMachine = berkelium >>> CryptoType = crypto/munge >>> DebugFlags = (null) >>> DefMemPerCPU = 2 >>> DisableRootJobs = NO >>> EnforcePartLimits = NO >>> Epilog = (null) >>> EpilogMsgTime = 2000 usec >>> EpilogSlurmctld = (null) >>> FastSchedule = 1 >>> FirstJobId = 1 >>> GetEnvTimeout = 2 sec >>> GresTypes = (null) >>> GroupUpdateForce = 0 >>> GroupUpdateTime = 600 sec >>> HASH_VAL = Match >>> HealthCheckInterval = 0 sec >>> HealthCheckProgram = (null) >>> InactiveLimit = 0 sec >>> JobAcctGatherFrequency = 30 sec >>> JobAcctGatherType = jobacct_gather/none >>> JobCheckpointDir = /var/slurm/checkpoint >>> JobCompHost = localhost >>> JobCompLoc = /var/log/slurm_jobcomp.log >>> JobCompPort = 0 >>> JobCompType = jobcomp/filetxt >>> JobCompUser = root >>> JobCredentialPrivateKey = (null) >>> JobCredentialPublicCertificate = (null) >>> JobFileAppend = 0 >>> JobRequeue = 1 >>> JobSubmitPlugins = (null) >>> KillOnBadExit = 0 >>> KillWait = 30 sec >>> Licenses = (null) >>> MailProg = /bin/mail >>> MaxJobCount = 10000 >>> MaxJobId = 4294901760 >>> MaxMemPerNode = UNLIMITED >>> MaxStepCount = 40000 >>> MaxTasksPerNode = 128 >>> MessageTimeout = 10 sec >>> MinJobAge = 300 sec >>> MpiDefault = none >>> MpiParams = (null) >>> NEXT_JOB_ID = 11897 >>> OverTimeLimit = 0 min >>> PluginDir = /usr/lib64/slurm >>> PlugStackConfig = /etc/slurm/plugstack.conf >>> PreemptMode = GANG,SUSPEND >>> PreemptType = preempt/partition_prio >>> PriorityDecayHalfLife = 00:07:00 >>> PriorityCalcPeriod = 00:05:00 >>> PriorityFavorSmall = 0 >>> PriorityFlags = 0 >>> PriorityMaxAge = 7-00:00:00 >>> PriorityUsageResetPeriod = NONE >>> PriorityType = priority/multifactor >>> PriorityWeightAge = 1 >>> PriorityWeightFairShare = 5 >>> PriorityWeightJobSize = 0 >>> PriorityWeightPartition = 100 >>> PriorityWeightQOS = 5 >>> PrivateData = none >>> ProctrackType = proctrack/pgid >>> Prolog = (null) >>> PrologSlurmctld = (null) >>> PropagatePrioProcess = 0 >>> PropagateResourceLimits = ALL >>> PropagateResourceLimitsExcept = (null) >>> RebootProgram = (null) >>> ReconfigFlags = (null) >>> ResumeProgram = (null) >>> ResumeRate = 300 nodes/min >>> ResumeTimeout = 60 sec >>> ResvOverRun = 0 min >>> ReturnToService = 1 >>> SallocDefaultCommand = (null) >>> SchedulerParameters = (null) >>> SchedulerPort = 7321 >>> SchedulerRootFilter = 1 >>> SchedulerTimeSlice = 30 sec >>> SchedulerType = sched/backfill >>> SelectType = select/cons_res >>> SelectTypeParameters = CR_CORE_MEMORY >>> SlurmUser = slurm(202) >>> SlurmctldDebug = info >>> SlurmctldLogFile = (null) >>> SlurmSchedLogFile = (null) >>> SlurmctldPort = 6817 >>> SlurmctldTimeout = 120 sec >>> SlurmdDebug = info >>> SlurmdLogFile = (null) >>> SlurmdPidFile = /var/run/slurmd.pid >>> SlurmdPort = 6818 >>> SlurmdSpoolDir = /var/spool/slurmd >>> SlurmdTimeout = 300 sec >>> SlurmdUser = root(0) >>> SlurmSchedLogLevel = 0 >>> SlurmctldPidFile = /var/run/slurmctld.pid >>> SLURM_CONF = /etc/slurm/slurm.conf >>> SLURM_VERSION = 2.4.3 >>> SrunEpilog = (null) >>> SrunProlog = (null) >>> StateSaveLocation = /var/spool/slurmsave >>> SuspendExcNodes = (null) >>> SuspendExcParts = (null) >>> SuspendProgram = (null) >>> SuspendRate = 60 nodes/min >>> SuspendTime = NONE >>> SuspendTimeout = 30 sec >>> SwitchType = switch/none >>> TaskEpilog = (null) >>> TaskPlugin = task/affinity >>> TaskPluginParam = (null type) >>> TaskProlog = (null) >>> TmpFS = /tmp >>> TopologyPlugin = topology/none >>> TrackWCKey = 0 >>> TreeWidth = 50 >>> UsePam = 0 >>> UnkillableStepProgram = (null) >>> UnkillableStepTimeout = 60 sec >>> VSizeFactor = 0 percent >>> WaitTime = 0 sec >>> >>> >>> >> -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
